Abstract:The primary task of constructing domain ontology is to obtain the relevant domain concepts. Many of these concepts are composed of domain compound words which are not included in the common dictionaries. So it is essential to extract domain compound words for the construction of domain ontology. Based on linguistic rules and statistical techniques, a hybrid extraction method combining the improved mutual information and language templates is proposed. Firstly, it extracts high frequency candidate domain compound words formed by a multi-word units using improved mutual information algorithm. On this basis, it extracts low frequency candidate domain compound words by language templates. Finally, domain compound words can be obtained through experts check. Experimental results show that the algorithm achieves a precision of 88.22%, which indicates this technique is fit for automatically and effectually extracting domain compound words from large corpora.