1、 Information extraction system: design and its commercial use After more than 30 years of development of the Information Extraction?(IE) system, it comes to be the focus of the attention of the researchers of information retrieval, database system and natural language processing. The reason is very
2、simple. Because the traditional information search engine can only give the result of relevant pieces of documents and information users must read them by themself to check if the documents match their requirements or not, but the IE system give the facts directly to the users. The pressure of the i
3、nformation explosion makes it impossible to read all the documents, news or Internet pages to locate their interested facts. But all the facts are important for the information usersto make any decision. IE system can help them to archive the goal, which is to know all the things that happened in th
4、e domain which they are interesting. The Information Extraction system filters the great amounts of documents that were written in nature human language or formatted in semi-structured to get the useful description ofthe facts which are interested by the information users. The shallow parsing techno
5、logy in sentence semantic analysis is the most popular method in the IE system. Most of IE systems use such technology to catch the information points in the text. Then, after post processing, such as co-reference analysis or repeated-facts clearing, all the filtered information points were put into
6、 slots of a well-structured template. All the filled templates combine the database which will be outputted. The output data can be read by human being or transfer to an automatic analysis program to process. During the research of the IE system, we know it is veryimportant to have the enough techno
7、logy backup. The ability of nature language processing, large-scale programming system development and some very serious knowledge base, such as syntactic and semantic dictionary and tree-bank corpus are the necessary parts of the backup. Today, a practically large-scale Information Extraction syste
8、m with the ability of easy domain-transformation is the main target of the research of IE. In this thesis, firstly, the development of a prototype of universal Information Extraction system is described. The main features of this system are: 1) using shallow parsing technology with sentence semantic
9、 template to extract the information in the text; 2) Easy modular-expansion or exchange; 3) multi-thread programming mechanics are used to improve the processing speed; 4) Chinese language is its working language. Secondly, the research of how the IE system to improve the enterprise information util
10、izing are also discussed. Today, many large Chinese enterprises use MIS, MRP-II or ERP system to improve their business management. The knowledge management technology is a very important aid to help the development of the enterprise. IE system can become the main part of the knowledge management sy
11、stem. Last part of this thesis is about how to build a proper business model for a practical IE system. Because of the fast development of the Internet, the basic information management technology, such as search engine, has become the base of a business empire. Yahoo is the most famous one. Does th
12、e IE technology can? I put forward my opinion that the IE system is avery useful and important information management tool in business but it can not be served as a search engine for public information user on the Internet recently. It can only run behind the scene and be controlled by human experts
13、 to collect information from the Internet. The collected information must take a post-process before it can be provided to the end user. 企业信息化背景下的信息提取系统 信息提取系统从诞生到现在已经大约 30 多年了,最近 10 年得到特别重视,并且成为信息检索系统、数据库系统以及自然语言处理系统发展的热门领域 。原因很简单,传统的信息检索只能向用户提供相关性的结果,是否对用户有用还需要用户自己去判断。面对网络时代信息量快速膨胀,以及信息用户要求提供精确和有用
14、的直接信息的双重压力,信息服务者必须掌握更有效地分析信息,从海量信息中提取重要或关键信息的手段。信息提取系统的要点在于通过过滤大量的非格式化或半格式化的文本资料,使用自然语言处理技术中的浅层分析手段,捕捉文本中有用的信息点并填写到所谓的信息槽中,并经过后处理,如解决共指成分,消除重复信息等过程,将难以用人力全部阅读并进行情报分析的自 然语言文本过滤掉那些对于解决特定问题的无效内容,并将有用内容转换成容易用人工或机器进行分析的格式化数据。 ?在信息提取系统的研究过程中,需要丰富的自然语言、形式语言的研究能力和技术储备,以及数据库系统、大规模软件工程开发的能力。大规模,而且具有较高的领域移植能力的
15、通用信息提取系统是研究工作的发展方向。在本论文研究工作中,作者所在的研究小组试图建立一种信息系统的开发原型,其特点在于利用浅层语法语义分析系统配合句法模版提取信息,并充分利用现代计算机系统并行处理的优势,使用多线程的编程技巧充分利用计算机的处理能力加 速信息提取系统的运行速度。信息提取系统对于现代化的情报分析工作已经成为必要的和不可缺少的辅助工具了。国外因为人力资源价值高昂的缘故,应用尤其广泛,甚至为了加快信息系统的数据录入工作而在数据库的信息录入模块中嵌入简化的使用自然语言分析技术的信息提取模块。而使用信息提取技术构建数据库,提供定题情报以及分析结果给用户使用的信息服务公司更是兴旺发达,有些
16、已经可以提供有效的经济分析数据供工商业用户使用,并开始大量盈利。 ?在中国企业的信息化进程中,在使用 MIS 以及 ERP 软件完成了财务以及开发、采购、生产、销售的全过程的 信息化任务后,知识管理系统成为对于企业发展最有推动力的信息技术。而有效的情报分析工作,更是制定企业发展战略以及企业运作决策的支柱。信息提取技术完全可以成为支持企业信息化工作中这两方面工作的核心软件。探索如何使信息提取技术顺利地融入企业信息化系统之中,让信息提取系统进入商业化运作道路,包括商业运作模式以及商业运作技巧,使系统开发有着更扎实地实用化背景,也是本课题的研究目标之一。 Yahoo 的系统开发与商业运作经验充分证明了商业化运作是技术开发可以 ?获得成功的关键因素。在 Internet 时代,网络是信息提取系统的 最主要的信息来源之一。信息提取系统的运行需要 Internet 的支持。和网络的结合将是信息提取系统获得成功的必由之路。另外,在某些领域,例如在军事情报分析或者辅助分析领域,信息提取技术一样可以显示巨大的应用潜力。