网络爬虫外文翻译---基于网络爬虫的有效URL缓存

资源ID：1414693 资源大小：70.32KB 全文页数：25页
资源格式： DOCX 下载积分：100金币

快捷下载

账号登录下载

三方登录下载：

下载资源需要100金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

网络爬虫外文翻译---基于网络爬虫的有效URL缓存

1、外文资料原文外文资料原文 Efficient URL Caching for World Wide Web Crawling Andrei Z. Broder IBM TJ Watson Research Center 19 Skyline Dr Hawthorne, NY 10532 Marc Najork Microsoft Research 1065 La Avenida Mountain View, CA 94043 Janet L. Wiener Hewlett Packard Labs 1501 Page Mill Road Palo Alto, CA 94304 ABST

2、RACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a) (c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move th

3、is plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be don

4、e well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further 外文资料原文 complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical alg

注意事项: 本文（网络爬虫外文翻译---基于网络爬虫的有效URL缓存）为本站会员（译***）主动上传，毕设资料网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请联系网站客服QQ：540560583，我们立即给予删除！