1
2
3
4
5
| wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
|
1
2
3
4
5
6
7
8
9
10
| 训练:
build/bin/lmplz -o 3 --verbose_header --text people2014corpus_words.txt --arpa result/people2014corpus_words.arps
压缩模型以使用:
build/bin/build_binary ./result/people2014corpus_words.arps ./result/people2014corpus_words.klm
其中:
-o n:最高采用n-gram语法
-verbose_header:在生成的文件头位置加上统计信息
--text text_file:指定存放预料的txt文件
--arpa:指定输出的arpa文件
|
1
2
3
| 1. 将C:\Program Files (x86)\Windows Kits\8.1\bin\x86 中的rc.exe、rcdll.dll到C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin
2. pip install https://github.com/kpu/kenlm/archive/master.zip
3. pip install pycorrector
|
1
2
3
4
5
| import pymongo
db = pymongo.MongoClient().weixin.text_articles
for text in db.find(no_cursor_timeout=True).limit(500000):
print ' '.join(text['text']).encode('utf-8')
|
1
2
| python p.py|./kenlm/bin/lmplz -o 4 > weixin.arpa
./kenlm/bin/build_binary weixin.arpa weixin.klm
|
1
2
3
4
5
6
7
| import kenlm
model = kenlm.Model('weixin.klm')
model.score('微 信', bos=False, eos=False)
'''
score函数输出的是对数概率,即log10(p('微 信')),其中字符串可以是gbk,也可以是utf-8
bos=False, eos=False意思是不自动添加句首和句末标记符
'''
|