Friday, February 6, 2015

EMR task/step restarted every 10 minutes. How to fix in python.

I was playing around with mapreduce amazon service (EMR) and noticed, that it keeps restarting my every 10 minutes. After quick search I found the reason. It happens that hadoop was killing task after 600 seconds of timeout, it can be also seen in the log file:

Task attempt .... failed to report status for 600 seconds.Killing!

Also in hadoop docs you can find the following:

By default 600000 is The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.

So if your mapper work takes long time to emit any output, you can change default timeout using option:

--jobconf mapred.task.timeout=10000000

But it still didn't work for me, because my mapper task could take days or weeks to finish its work.

So I decided simply update status string every X seconds in my python script. Let's go back to hadoop docs again and read:

How do I update status in streaming applications?  A streaming process can use the stderr to emit status information. To set a status, reporter:status:<message> should be sent to stderr.

In python code this can be done like this:

print >> sys.stderr, "reporter:status:" + "hello_hadoop"

linux command to extract .gz, .tar.bz2, .tar.gz files

I always forget how to decompress/extract certain file types in linux. Below I publish list of linux commands to decompress different archives:

Linux extract commands

gunzip filename.gz
coommand will delete original file after extraction

bzip2 -d filename.bz2
coommand will delete original file after extraction
bzip2 -dk filename.bz2
this command also extracts but keep original

tar -xvzf filename.tar.gz
  • x: tar can collect files or extract them. x does the latter.
  • v: makes tar talk a lot. Verbose output shows you all the files being extracted.
  • z: tells tar to decompress the archive using gzip
  • f: this must be the last flag of the command, and the tar file must be immediately after. It tells tar the name and path of the compressed file.

Sunday, February 1, 2015

Python: No crypto library available in Windows. How To FIx.

While developing some script with google drive api in order to upload files to google drive, I faced with the following error:

oauth2client.client.CryptoUnavailableError: No crypto library available

This is how I fixed it in ubuntu:
apt-get install python-openssl

And this is how to fix in Windows:
pip install pyopenssl