计算机外文翻译---万维网爬行的有效URL缓存（节选）

上传人：我***

文档编号：148817

上传时间：2017-05-16

格式：DOC

页数：21

大小：68.50KB

《计算机外文翻译---万维网爬行的有效URL缓存（节选）》由会员分享，可在线阅读，更多相关《计算机外文翻译---万维网爬行的有效URL缓存（节选）（21页珍藏版）》请在毕设资料网上搜索。

1、4300单词， 2.1 万英文字符， 5500 汉字出处： Broder A Z, Najork M, Wiener J L. Efficient URL caching for world wide web crawlingC/ 2003:679-689. Efficient URL caching for world wide web crawling Andrei Z. Broder, Marc Najork, Janet L. Wiener ABSTRACT Crawling the web is deceptively simple: the basic algorithm is

2、(a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic

3、and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to

4、store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL c

5、aching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data ext

6、racted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explan