1、 I 摘要 目前即使通讯软件在平时的生活中有着十分广泛的应用, 但是对绝大部分的软件 来说,都必须应用在互联网上,必须在一个 INTERNET 环境下才能使用。有时候单位 内部的员工,同学,在没有互联网环境下或因其他原因希望不用 INTERNET 就可以进 行信息交互,这样开发局域网通信就有了必要性。本文提出了局域网信息交互的需求, 并详细对网络协议 TCP/IP 协议族进行了介绍和研究, 如 TCP, UDP, 广播等相关技术。 并对网络信息交互原理惊醒了说明,在此基础上利用 SOCKET 网络编程实现了一种基 于 WINDOWS 平台的局域网信息交互功能。 网络爬虫是一种自动搜集互联网信息
2、的程序。 通过网络爬虫不仅能够为搜索引擎采集网 络信息, 而且可以作为定向信息采集器, 定向采集某些网站下的特定信息, 如招聘信息, 租房信息等。 本文通过 JAVA 实现了一个基于广度优先算法的多线程爬虫程序。为何要使用多线 程,以及如何实现多线程;系统实现过程中的数据存储;网页信息解析等。 通过实现这一爬虫程序,可以搜集某一站点的 URLs,并将搜集到的 URLs 存入数 据库。将解析的网页存入 XML 文档。 【关键词】 网络爬虫; SOCKET 编程;TCP/IP;网络编程 ;JAVA II Abstract Instant message software in our daily
3、lives has a very wide range of application , However ,most of the software must be used in the Internet , and it must be used in a Internet environment .Sometimes Internal staff, students ,may not have the Internet environment or other reasons do not wish to be able to communicate on the Internet .T
4、his development will have a need for LAN communication program .Therefore ,this paper presents the needs of local area network exchange information Software ,And details of the network protocol TCP/IP protocol suite are introduced and research such as TCP, UDP, broadcast ,and other technologies . an
5、d network information exchange theory is discussed . Base on this condition I use of Socket Network programming based on Windows platform to develop a LAN chat application . SPIDER is a program which can auto collect informations from internet. SPIDER can collect data for search engines, also can be
6、 a Directional information collector, collects specifically informations from some web sites, such as HR informations, house rent informations. In this paper, use JAVA implements a breadth-first algorithm multi-thread SPDIER. This paper expatiates some major problems of SPIDER: why to use multi-threading, and how to implement multi-thread; data structure; HTML code parse. etc. This SPIDER can collect URLs from one web site, and store URLs into database. 【KEY WORD】SPIDER; JAVA;;Socket programmi