Choosing a web crawler

Abstract

The goal of this post is to describe the several tools for crawling and scraping, in order to select the one best suited for your need.

Definitions

Scraping consists in getting data from a single website.
Crawling, on the opposite, gathers data from several websites or even the complete Internet.

Scrapy

Fast and easy to use

http://scrapy.org/

This Python project is best suited for scraping focused websites. It is very simple and easy to use and works asynchronously, to avoid waiting for http responses, which is time consuming.

This tool is usually used to craft a handmade parser on a known website in order to extract precise informations. When the structure is not know or scaling up to lots of different websites, we will need something else. Moreover it is not a distributed system, thus it is not the best to use for a huge amount of websites. A solution could be to run several processes on different machines, each with specific list of websites.

Common crawl

Shared Web DataBase

http://commoncrawl.org/

This is not a tool but a huge database of crawled websites. It offers 3 formats: text, metadata (contained links, etc.) and full html (with images). This crawl is made on all the web and are available by month or season. There are not sorted in any other way.

The point is that data is so huge that you actually need to crawl common crawl (!), in order to get the data that you want. Thus It may be relevant for people who would like to crawl the full world wide web, or English data only, which makes the biggest part in Common crawl. But for those who are interested on a specific subject or language, like news web sites in Chinese, this is probably not the best tool.

Some methods exist to estimate, by statistics, the language or subject of websites.

Nutch

Distributed Tank

http://nutch.apache.org/

This is a large Java project which can be compared to the Google bot. It is made for crawling all the web and can be distributed on several machines, such as on AWS. The architecture is pretty heavy and has lots of dependencies (ant, gora, hbase, hadoop, etc.). But it is also very powerful and manages scheduling, domain restriction, politeness, etc. Thus it could be interesting for crawling a specific targeted group of websites (e.g., French new websites), every x days or hours.

The heaviest part of Nutch is Hadoop which is not trivial to install in cluster mode and will require some time.

Nutch is able to run without Hadoop, in standalone mode. But the most interesting part of Nutch is the distributed mode, thus Scrapy is probably better in that case.

Nutch is not a simple distributed scraping. It is very polite and makes sure that there is only one query per host running at the same time, to avoid being blacklisted. Keep in mind that crawling without any politeness may be considered as a DOS attack.

It includes adaptive fetching capabilities as well, by fetching the changing pages more often than « static » pages.
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

Nutch 2.x is almost a rewrite from scratch and therefore not so close to 1.x. The biggest modification is the integration of Apache Gora, allowing to use various databases such as Hbase, Cassandra, etc. However, Nutch 2.x is slower and has less features than Nutch 1.x.
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

Other scraper projects

There are many more scraper projects, mainly written in Python and made for scraping a specific web site or for indexing. But clearly, none is as easy to use as Scrapy and most have smaller communities.