1、外文资料原文 外文资料原文 Efficient URL Caching for World Wide Web Crawling Andrei Z. Broder IBM TJ Watson Research Center 19 Skyline Dr Hawthorne, NY 10532 Marc Najork Microsoft Research 1065 La Avenida Mountain View, CA 94043 Janet L. Wiener Hewlett Packard Labs 1501 Page Mill Road Palo Alto, CA 94304 ABST
2、RACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a) (c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move th
3、is plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be don
4、e well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further 外文资料原文 complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical alg