引用本文:[点击复制]
[点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论关闭

←前一篇|后一篇→

过刊浏览    高级检索

本文已被:浏览 118次   下载 165  
农业新闻数据源增量爬虫的研究
杨广召
0
(塔里木大学)
摘要:
随着农业新闻数据日益膨胀,保证以农业为主题的增量爬虫成为爬取农业信息的相关手段,增量爬虫的原理可以依据农业新闻数据的更新爬取数据相关更新的内容,剔除出已经爬取的重复内容[[]]。文章结合农业新闻数据信息的特点,提出了一种适用于农业新闻信息的基于Redis的布隆过滤器的增量去重方法,摆脱超大的持久化文件撑爆内存的问题。通过实验证明随着抓取相关农业信息的增加,该方法在保证内存不被撑爆同时能有效提高增量爬取农业信息的效率,在增量信息爬取的过程中具有很好的应用价值[[]]。
关键词:  增量爬虫  农业新闻  去重
DOI:
投稿时间:2020-08-02修订日期:2020-08-02
基金项目:
Research on Incremental Crawler of Agricultural News Data Source
()
Abstract:
With the increasing expansion of agricultural news data, it is ensured that incremental crawlers with the theme of agriculture become a relevant means of crawling agricultural information. The principle of incremental crawlers can crawl the content related to the data based on the update of agricultural news data, and remove the crawled content. Duplicate content taken. Combining the characteristics of agricultural news data and information, this article proposes an incremental deduplication method based on Redis-based Bloom filter for agricultural news information, which can get rid of the problem of large persistent files bursting memory. Experiments show that with the increase of crawling related agricultural information, this method can effectively improve the efficiency of incremental crawling of agricultural information while ensuring that the memory is not burst. It has good application value in the process of incremental information crawling.
Key words:  incremental crawler  agricultural forum  de-duplication

用微信扫一扫

用微信扫一扫