一种领域合成词的抽取方法
DOI:
作者:
作者单位:

作者简介:

通讯作者:

基金项目:

国家973计划资助项目(2012CB316303);国家自然科学基金资助项目(60933005)

伦理声明:



A method of domain compound words extraction
Author:
Ethical statement:

Affiliation:

Funding:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    构建领域本体的首要任务是获取领域相关的概念,这些概念很多是由常用词典库中没有收录的领域合成词组成,因此抽取领域合成词对于领域本体的构建至关重要。本文基于语言规则和统计技术,提出一种结合改进互信息和语言模板的领域合成词抽取方法。首先利用改进的互信息算法抽取由多字词单位构成的高频次候选领域合成词,在此基础上,利用语言模板匹配抽取低频次候选领域合成词,最后由专家进行检验,得到领域合成词集。实验结果表明,该算法的领域合成词提取准确率达到88.22%,适用于从大规模网页文本中自动高效地抽取领域合成词。

    Abstract:

    The primary task of constructing domain ontology is to obtain the relevant domain concepts. Many of these concepts are composed of domain compound words which are not included in the common dictionaries. So it is essential to extract domain compound words for the construction of domain ontology. Based on linguistic rules and statistical techniques, a hybrid extraction method combining the improved mutual information and language templates is proposed. Firstly, it extracts high frequency candidate domain compound words formed by a multi-word units using improved mutual information algorithm. On this basis, it extracts low frequency candidate domain compound words by language templates. Finally, domain compound words can be obtained through experts check. Experimental results show that the algorithm achieves a precision of 88.22%, which indicates this technique is fit for automatically and effectually extracting domain compound words from large corpora.

    参考文献
    相似文献
    引证文献
引用本文

刘 剑.一种领域合成词的抽取方法[J].太赫兹科学与电子信息学报,2014,12(6):870~873

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
历史
  • 收稿日期:2013-12-11
  • 最后修改日期:2014-03-17
  • 录用日期:
  • 在线发布日期: 2015-01-05
  • 出版日期: