1、基于文本的聚类算法研究 I 摘 要 聚类作为一种知识发现的重要方法,它广泛地与中文信息处理技术相结合, 应用于网络信息处理中以满足用户快捷地从互联网获得自己需要的信息资源。文 本聚类是聚类问题在文本挖掘中的有效应用,它根据文本数据的不同特征,按照 文本间的相似性,将其分为不同的文本簇。其目的是要使同一类别的文本间的相 似度尽可能大,而不同类别的文本间的相似度尽可能的小。整个聚类过程无需指 导,事先对数据结构未知,是一种典型的无监督分类。 本文首先介绍了文本聚类的相关的技术,包括文本聚类的过程,文本表示模 型, 相似度计算及常见聚类算法。 本文主要研究的聚类主要方法是 k-均值和 SOM 算法,
2、介绍了两种算法的基本思想和实现步骤,并分析两种算法的聚类效果。同 时介绍了两种算法的改进算法。 关键词:文本聚类 聚类方法 K-MEAN SOM 基于文本的聚类算法研究 II Abstract Clustering as an important knowledge discovery method, which extensively with Chinese information processing technology, used in network information processing to meet the users to quickly access from th
3、e Internet, the information resources they need. Text clustering is a clustering problem in the effective application of text mining, which according to the different characteristics of text data, according to the similarity between the text, the text will be divided into different clusters. The aim
4、 is to make the same class as large as possible the similarity between the text, and different types of text as small as possible the similarity between. The clustering process without guidance, prior to the data structure is unknown, is a typical unsupervised classification. This paper studies the effect of influencing factors that text clustering, text representation of the model such as the Boolean model, vector space model, probabilistic retrieval model and language model. Also studied t