计算机外文翻译---基于网络爬虫的有效URL缓存
《计算机外文翻译---基于网络爬虫的有效URL缓存》由会员分享,可在线阅读,更多相关《计算机外文翻译---基于网络爬虫的有效URL缓存(24页珍藏版)》请在毕设资料网上搜索。
1、外文资料原文 外文资料原文 Efficient URL Caching for World Wide Web Crawling Andrei Z. Broder IBM TJ WatsonResearchCenter 19 Skyline Dr Hawthorne, NY10532 Marc Najork Microsoft Research 1065 La Avenida Mountain View, CA94043 Janet L. Wiener Hewlett Packard Labs 1501 Page Mill Road Palo Alto, CA94304 ABSTRACT
2、Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all theURLs not seen before, repeat (a)(c). However, the size of the web(estimated at over 4 billion pages) and its rate of change (estimatedat 7% per week) move this plan f
3、rom a trivial programming exerciseto a serious algorithmic and system design challenge. Indeed, thesetwo factors alone imply that for a reasonably fresh and completecrawl of the web, step (a) must be executed about a thousand timesper second, and thus the membership test (c) must be done wellover te
4、n thousand times per second against a set too large to storein main memory. This requires a distributed architecture, whichfurther complicates the 外文资料原文 membership test. A crucial way to speed up the test is to cache, that is, to store inmain memory a (dynamic) subset of the “seen” URLs. The maingo
5、al of this paper is to carefully investigate several URL cachingtechniques for web crawling. We consider both practical algorithms:random replacement, static cache, LRU, and CLOCK, andtheoretical limits: clairvoyant caching and infinite cache. We performedabout 1,800 simulations using these algorith
6、ms with variouscache sizes, using actual log data extracted from a massive 33day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective in oursetup, a cache of roughly 50,000 entries can achieve a hit rate ofalmost 80%. Interestingly, this cache si
7、ze falls at a critical point: asubstantially smaller cache is much less effective while a substantiallylarger cache brings little additional benefit. We conjecturethat such critical points are inherent to our problem and venture anexplanation for this phenomenon. 1. INTRODUCTION A recent Pew Foundat
8、ion study 31 states that “Search engineshave become an indispensable utility for Internet users” and estimatesthat as of mid-2002, slightly over 50% of all Americans haveused web search to find information. Hence, the technology thatpowers web search is of enormous practical interest. In this paper,
9、we concentrate on one aspect of the search technology, namelythe process of collecting web pages that eventually constitute thesearch engine corpus. Search engines collect pages in many ways, among them directURL submission, paid inclusion, and URL extraction from nonwebsources, but the bulk of the
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中设计图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 计算机 外文 翻译 基于 网络 爬虫 有效 url 缓存
