我们做文章相关推荐的时候,需要根据关键词来调用相关文章,从而提高页面的相关性。那我们需要实现这个效果的话,下一步就是提取文章关键词了。
提取关键词有2个方法:
- 人工阅读文章后识别关键词
- 通过程序提取文章的关键词
如果文章的量比较多。用程序的方法是比较便捷的,python中一个nltk模块,里面有相应的方法来提取页面的核心词。
上代码:
import nltk
from nltk.probability import FreqDist
binfo = '''
shibang Mibile Crusher offers optimum set-up flexibility, from coarse to fine bashing, and is cost efficient. The application improves working safety, reduces the need for quarry highway maintenance, and gives coal minng significantly better access to material information. A further benefit is this : waste material can be separated on-site. shibang Mibile Crusher can be arranged to provide a two-stage crushing and evaluating system, as a three-stage abrasive, secondary and tertiary mashing and screening method, or as three independent units. Getting crushing and evaluating process on small wheels really boosts approach efficiency. shibang Mibile Crusher can be used for all mobile mashing applications, opening up clients opportunities. shibang Collection Cell Crusher is wholly adaptable to all cellular crushing needs.The item sets up a new variety of business opprtunities for trades-people,quarry operators, recycling along with mining applications.
'''
textinfo = nltk.word_tokenize(binfo)##分词
tagged = nltk.pos_tag(textinfo)##词性
fdist1 = FreqDist(textinfo)
minfo = dict(fdist1)
info = list(set([k.lower() for k,v in tagged if v == 'NN']))##所有名词
kinfo = [(k,minfo.get(k)) for k in info]
kinfo.sort(key=lambda k:k[1],reverse=True)
print ",".join([m[0] for m in kinfo[:5]])
输出的结果就是:shibang,crushing,material,mashing,process
这是出现频率最高的5个名词。
其实搜索引擎看待我们的文章内容的中心意思的时候,也可能根据高频词来判断的,类似tag标签的概念,这样我们就可以根据核心词来提取相关词。