外文翻译---基于网络爬虫的有效URL缓存

资源ID：132085 资源大小：42.15KB 全文页数：11页
资源格式： DOCX 下载积分：100金币

快捷下载

账号登录下载

三方登录下载：

下载资源需要100金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

外文翻译---基于网络爬虫的有效URL缓存

1、外文资料原文 Efficient URL Caching for World Wide Web Crawling Marc Najork BMJ (International Edition) 2009 Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all theURLs not seen before, repeat (a)(c). However, the size of the we

2、b(estimated at over 4 billion pages) and its rate of change (estimatedat 7% per week) move this plan from a trivial programming exerciseto a serious algorithmic and system design challenge. Indeed, thesetwo factors alone imply that for a reasonably fresh and completecrawl of the web, step (a) must b

3、e executed about a thousand timesper second, and thus the membership test (c) must be done wellover ten thousand times per second against a set too large to storein main memory. This requires a distributed architecture, whichfurther complicates the membership test. A crucial way to speed up the test

4、 is to cache, that is, to store inmain memory a (dynamic) subset of the “seen” URLs. The maingoal of this paper is to carefully investigate several URL cachingtechniques for web crawling. We consider both practical algorithms:random replacement, static cache, LRU, and CLOCK, andtheoretical limits: c

5、lairvoyant caching and infinite cache. We performedabout 1,800 simulations using these algorithms with variouscache sizes, using actual log data extracted from a massive 33day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective in oursetup, a ca

6、che of roughly 50,000 entries can achieve a hit rate ofalmost 80%. Interestingly, this cache size falls at a critical point: asubstantially smaller cache is much less effective while a substantiallylarger cache brings little additional benefit. We conjecturethat such critical points are inherent to

7、our problem and venture anexplanation for this phenomenon. 1. INTRODUCTION A recent Pew Foundation study 31 states that “Search engineshave become an indispensable utility for Internet users” and estimatesthat as of mid-2002, slightly over 50% of all Americans haveused web search to find information

8、. Hence, the technology thatpowers web search is of enormous practical interest. In this paper,we concentrate on one aspect of the search technology, namelythe process of collecting web pages that eventually constitute thesearch engine corpus. Search engines collect pages in many ways, among them di

9、rectURL submission, paid inclusion, and URL extraction from nonwebsources, but the bulk of the corpus is obtained by recursivelyexploring the web, a process known as crawling or SPIDERing. Thebasic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen be

10、fore, repeat (a)(c) Crawling typically starts from a set of seed URLs, made up ofURLs obtained by other means as described above and/or made upof URLs collected during previous crawls. Sometimes crawls arestarted from a single well connected page, or a directory such , but in this case a relatively

11、large portion of the web(estimated at over 20%) is never reached. See 9 for a discussionof the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph, and hyperlinks as directededges among these nodes, then crawling becomes a processknown in mathematical c

12、ircles as graph traversal. Various strategiesfor graph traversal differ in their choice of which node amongthe nodes not yet explored to explore next. Two standard strategiesfor graph traversal are Depth First Search (DFS) and BreadthFirst Search (BFS) they are easy to implement and taught in manyin

13、troductory algorithms classes. (See for instance 34). However, crawling the web is not a trivial programming exercisebut a serious algorithmic and system design challenge because ofthe following two factors. 1. The web is very large. Currently, Google 20 claims to haveindexed over 3 billion pages. V

14、arious studies 3, 27, 28 haveindicated that, historically, the web has doubled every 9-12months. 2. Web pages are changing rapidly. If “change” means “anychange”, then about 40% of all web pages change weekly12. Even if we consider only pages that change by a thirdor more, about 7% of all web pages

15、change weekly 17. These two factors imply that to obtain a reasonably fresh and679complete snapshot of the web, a search engine must crawl at least100 million pages per day. Therefore, step (a) must be executedabout 1,000 times per second, and the membership test in step (c)must be done well over te

16、n thousand times per second,against aset of URLs that is too large to store in main memory. In addition,crawlers typically use a distributed architecture to crawl more pagesin parallel, which further complicates the membership test: it ispossible that the membership question can only be answered by

17、apeer node, not locally. A crucial way to speed up the membership test is to cache a (dynamic)subset of the “seen” URLs in main memory. The main goalof this paper is to investigate in depth several URL caching techniquesfor web crawling. We examined four practical techniques:random replacement, stat

18、ic cache, LRU, and CLOCK, and comparedthem against two theoretical limits: clairvoyant caching andinfinite cache when run against a trace of a web crawl that issuedover one billion HTTP requests. We found that simple cachingtechniques are extremely effective even at relatively small cachesizes such

19、as 50,000 entries and show how these caches can be implementedvery efficiently. The paper is organized as follows: Section 2 discusses the variouscrawling solutions proposed in the literature and how cachingfits in their model. Section 3 presents an introduction to cachingtechniques and describes se

20、veral theoretical and practical algorithmsfor caching. We implemented these algorithms under the experimentalsetup described in Section 4. The results of our simulationsare depicted and discussed in Section 5, and our recommendationsfor practical algorithms and data structures for URL caching arepresented in Section 6. Section 7 contains our conclusions and directionsfor further research.

注意事项: 本文（外文翻译---基于网络爬虫的有效URL缓存）为本站会员（泛舟）主动上传，毕设资料网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请联系网站客服QQ：540560583，我们立即给予删除！