计算机外文翻译---万维网爬行的有效URL缓存（节选）

资源ID：148817 资源大小：68.50KB 全文页数：21页
资源格式： DOC 下载积分：150金币

快捷下载

账号登录下载

三方登录下载：

下载资源需要150金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

计算机外文翻译---万维网爬行的有效URL缓存（节选）

1、4300单词， 2.1 万英文字符， 5500 汉字出处： Broder A Z, Najork M, Wiener J L. Efficient URL caching for world wide web crawlingC/ 2003:679-689. Efficient URL caching for world wide web crawling Andrei Z. Broder, Marc Najork, Janet L. Wiener ABSTRACT Crawling the web is deceptively simple: the basic algorithm is

2、(a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic

3、and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to

4、store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL c

5、aching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data ext

6、racted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explan

注意事项: 本文（计算机外文翻译---万维网爬行的有效URL缓存（节选））为本站会员（我***）主动上传，毕设资料网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请联系网站客服QQ：540560583，我们立即给予删除！