How to install Nutch on an AWS EC2 Cluster

In order to install Nutch on an amazon EC2 Cluster, you will need a good comprehension of Nutch and Hadoop, this is the goal of this post.

We deviced to create this tutorial as we ran into many basic problems for which there was no clear solution documented on the web.

Version

Here we will explain how to install Nutch 1.9 on Debian wheeze, there is no guarantee that the following instructions will work with a different setup. You need to be very careful about getting the right versions, otherwise you may be exposed to lots of incomprehensible issues.

Note : this tutorial works with Nutch 1.10 as well.

Goal

In our case we are interested in directly getting the raw HTML from web pages of crawled web sites. We don’t need the indexation of pages. Thus our choices have been made with this goal in mind, you will have to decide if this is the best in your case.

Why Nutch ?

If you are interested by why we choose Nutch instead of another crawler / scraper, you can read our post: Choosing a Web Crawler.

Why Nutch 1.9 instead of 2.x ?

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

Nutch 2.x is rewrite from scratch. The biggest modification is the integration of Apache Gora, which allows to connect to several DB such Hbase, Cassandra, etc. However, Nutch 2.x is slower and has less features than Nutch 1.x.
This is why we choose 1.x.
Moreover, our only goal is to get the html code, thus we will not store it into database, we will prefer to store it directly into files.

Install Nutch 1.9

Java 7

First, you will need to install java 7
http://www.webupd8.org/2012/06/how-to-install-oracle-java-7-in-debian.html

Nutch 1.9

Go to http://wiki.apache.org/nutch/NutchTutorial, for the installation.

If you would like to implement a specific behavior in Nutch, such as a custom parsing, it is easier to modify the source code than writing a plugin. On the other hand, you will lose the possibility to use it with other Nutch versions.

For all instructions that follow, do not forget to replace all environment variables by your owns values, specially in the xml code.

If you decided to install from source, do not forget to install ant and to specify your plugin folder path in $NUTCH_HOME/conf/nutch-site.xml

<property><name>plugin.folders</name><value>$NUTCH_HOME/build/plugins</value></property>

Define the temporary folder into $NUTCH_HOME/conf/nutch-default.xml .

<property><name>mapred.temp.dir</name><value>/tmp</value></property>

It is also possible to configure an adaptive fetch, which will crawl more often the pages which are changing regularly.
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

You can define url filter in order to crawl only specific websites using $NUTCH_HOME/conf/regex-urlfilter.txt

To manage the starting urls, you should create files with one url per line into a directory given as seed to Nutch.

If you would like to get the best performance with your crawler, you can add the following parameters to $NUTCH_HOME/conf/nutch-default.xml, but beware that it removes the politeness that Nutch offers !

<property><name>fetcher.server.delay</name><value>0.1</value></property>
<property><name>fetcher.threads.fetch</name><value>100</value></property>
<property><name>fetcher.threads.per.queue</name><value>100</value></property>

Then edit $NUTCH_HOME/src/bin/crawl and change numSlaves=1 at line 54 to your real number of slaves. You can also increase numTasks at line 59 from * 2 to * 10 and the numThreads at line 68 from 50 to 100.

If your goal is like ours, that is to only get HTML or text without indexing, you can disable Solr. To do so, comment the Solr operations of the crawl script Link inversion, Indexing on Solr and Cleanup on Solr and skip the Solr installation part.

Then compile

ant runtime

Solr 3.4

Be sure to install Solr 3.4 and not 4.6 as explained in the following link.

https://pacoup.com/2014/02/05/install-solr-4-6-with-tomcat-7-on-debian-7/

The following tutorial explain how to install Solr for Nutch.
http://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search

You will also need to change all occurrences of df text to df content in solrconfig.xml

sed -i 's#<str name="df">text</str>#<str name="df">content</str>#g' $SOLR_HOME/example/solr/conf/solrconfig.xml

Copy the Solr jar into the Nutch lib as follows:

cp $SOLR_HOME/dist/apache-solr-solrj-3.4.0.jar $NUTCH_HOME/lib/.

Then compile Nutch from $NUTCH_HOME

ant runtime

Hadoop 1.2.1

Refer to following pages for the installation, the configuration explanation is below.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://wiki.apache.org/nutch/NutchHadoopTutorial

Generate a SSH key pair:

ssh-keygen -t dsa -P '' -f id_dsa

Edit $HADOOP_PREFIX/conf/hadoop-env.sh to change JAVA_HOME and add:

export HADOOP_PREFIX=$HOME/hadoop-1.2.1
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Edit $HADOOP_PREFIX/conf/core-site.xml, to add:

<property><name>fs.default.name</name><value>hdfs://$MASTER_IP:9000</value></property>
<property><name>dfs.permissions</name><value>false</value></property>
<property><name>hadoop.tmp.dir</name><value>/tmp</value></property>

If you are working with Amazon S3, you will need to define the following in $HADOOP_PREFIX/conf/core-site.xml

If you are working with S3n, just replace s3 by s3n.

<property><name>fs.s3.awsAccessKeyId</name><value>$MY_ACCESS_KEY</value></property>
<property><name>fs.s3.awsSecretAccessKey</name><value>$MY_SECRET_KEY</value></property>

Edit $HADOOP_CONF_DIR/hdfs-site.xml, to add:

<property><name>dfs.http.address</name><value>$MASTER_IP:50070</value></property>
<property><name>dfs.name.dir</name><value>$HADOOP_PREFIX/dfs/name</value><final>true</final></property>
<property><name>dfs.data.dir</name><value>$HADOOP_PREFIX/dfs/name/data</value><final>true</final></property>
<property><name>dfs.replication</name><value>2</value></property>

Edit $HADOOP_CONF_DIR/mapred-site.xml to add:

<property><name>mapred.job.tracker</name><value>$MASTER_IP:9001</value></property>
<property><name>mapred.system.dir</name><value>$NUTCH_HOME/filesystem/mapreduce/system</value></property>
<property><name>mapred.local.dir</name><value>$NUTCH_HOME/filesystem/mapreduce/local</value></property>

Edit $HADOOP_CONF_DIR/masters to replace localhost by your $MASTER_IP
Edit $HADOOP_CONF_DIR/slaves to remove localhost and add all your $SLAVE_IP.

Deploy the key and the configuration on all slaves nodes:

scp $HADOOP_CONF_DIR/* "$SLAVE_IP:$HADOOP_CONF_DIR/."

If you installed Nutch from source:

ssh $SALVE_IP "mkdir -p $NUTCH_HOME/build"
scp -r $NUTCH_HOME/build/plugins "$SLAVE_IP:$NUTCH_HOME/build/plugins"

Start Hadoop

$HADOOP_PREFIX/bin/hadoop namenode -format
$HADOOP_PREFIX/bin/start-all.sh

S3cmd 1.5.2

https://github.com/s3tools/s3cmd
If you would like to launch scripts to interact with s3, you should install S3cmd.

./s3cmd --configure --access_key=$MY_ACCESS_KEY --secret_key=$MY_SECRET_KEY

FileSystem

S3 looks to be a good idea for saving data, moreover hadoop is able to natively work with S3.

But S3 is slower than HDFS, thus we do discourage you to use S3directly.

We tried to keep the Hadoop tmp file on HDFS in order avoid a performance loss and only put the output on S3, unfortunately, it is not possible, at least for Nutch.

Therefore the solution is to let everything on HDFS and frequently run a script which transfer results to S3.

It exists three methods of S3, S3a looks to be the best, but unfortunately it is not implemented in Hadoop version 1.2.1, thus we decided to use s3 (alias s3b).

S3 Method

https://wiki.apache.org/hadoop/AmazonS3

  • S3n: A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3.
  • S3 (alias S3b): A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem – you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.
  • S3a: A successor to the S3 Native, s3n fs, the S3a: system uses Amazon’s libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.

S3 Interaction

To interact with your S3 bucket, we used Hadoop tools and S3cmd: http://s3tools.org/download

Here is the S3cmd usage: http://s3tools.org/usage

Run Nutch 1.9

If you disabled Solr into the crawl script, you can set SOLR_URL var to an empty string.

Standalone

URLS="urls/"
CRAWL="crawl"
MAX_ITERATION=1000
SOLR_URL="http://127.0.0.1:8983/solr/"
SOLR_HOME="$HOME/apache-solr-3.4.0"
NUTCH_HOME="$HOME/apache-nutch-1.9"

cd $SOLR_HOME/example/
java -jar start.jar > ~/solr.log 2>&1 &
cd -

$NUTCH_HOME/runtime/local/bin/crawl $URLS $CRAWL $SOLR_URL $MAX_ITERATION

Deployed mode

URLS="urls/"
CRAWL="crawl"
MAX_ITERATION=1000
SOLR_IP=`sudo ifconfig | grep 'inet addr' | grep 'Bcast' | awk '{print $2}' | sed 's/addr://g'`
SOLR_URL="http://$SOLR_IP:8983/solr/"
SOLR_HOME="$HOME/apache-solr-3.4.0"
NUTCH_HOME="$HOME/apache-nutch-1.9"
HADOOP_PREFIX="$HOME/hadoop-1.2.1"

$HADOOP_PREFIX/bin/start-all.sh

cd $SOLR_HOME/example/
java -jar start.jar > ~/solr.log 2>&1 &
cd -

$HADOOP_PREFIX/bin/hadoop dfsadmin -safemode wait
$HADOOP_PREFIX/bin/hadoop dfs -put urls/ urls
$NUTCH_HOME/runtime/deploy/bin/crawl $URLS $CRAWL $SOLR_URL $MAX_ITERATION

Stop crawling

If you would like to stop the script before the end. You can do it properly at the end of the next segment by creating a file .STOP when the script was launched.

Restart crawling

If you would like to restart crawling, restart where you where and not refetch everything since the beginning. If you stop the script properly (.STOP), basically you just have to skip the Nutch inject command and it will works. If you did not, there are several cases:

If your crawldb is not locked (contains .locked), you can just skip the inject command as well.

If it is locked, you can just try to remove the .locked and skip the inject command again.

If you got any error with the above method, your crawldb is lost. Thus you have to generate it, by running Nutch updatedb on all finished segments and then running Nutch dedup, and, if you are lucky it will hopefully works ! But it will take a while. Personally I never tried that method.

Extraction

Nutch allow you to directly get back data very easily by launching Nutch readseg. It works for local or deploy mode, you just need to edit the path for you case.

$NUTCH_HOME/runtime/$MODE/bin/nutch readseg -dump crawl/segments/* output/ -nogenerate -noparse -noparsedata -noparsetext

You can easily adapt results of your need by reading the usage of readseg. Basically you can get back everything that Nutch get (metadata, query answer, parsed text, etc).

Put every options expect -nocontent to get the html.

And put every options expect -noparsetext to get the text.

Backup to S3

For backup crawl data on S3 you should hadoop in order to directly transfer your data from your HDFS to your S3.

hadoop distcp -update "hdfs://$MASTER_IP:9000/user/admin/crawl" "s3://$BUCKET/crawl"

You can cp the segments folder in parallel of Nutch, because segment of one iteration is never use again after the iteration. But the crawldb is updated at each iteration, thus you need to wait crawldb finished to cp before let Nutch do the next iteration.

If you proceed as above, write with Hadoop to s3 (and not s3n), unfortunately only hadoop will be able to read data of your bucket and you are not going to be able to use other tools such s3cmd.

IMPORTANT

Never use the -p option with distcp command on S3 which preserve inter alias user, group and permission. S3 does not manage any permission in file or directory level. Thus it just slows down the copy for nothing. Moreover if you are doing it in the other direction, from S3 to HDFS, with -p, it will fail at map 100% !

Amazon

We use the image: ami-7ffae53a, which is a Debian, wheeze, pvm.

We use spot instances in order to reduce the cost.

We use m3.medium type, because t1 and t2 are made for very small CPU usage.

You can use Amazon EC2 CLI to automatically request Amazon instances.
http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html#tools-introduction