Thursday, February 5, 2015

EMR task/step restarted every 10 minutes. How to fix in python.

I was playing around with mapreduce amazon service (EMR) and noticed, that it keeps restarting my mapper.py every 10 minutes. After quick search I found the reason. It happens that hadoop was killing task after 600 seconds of timeout, it can be also seen in the log file:

Task attempt .... failed to report status for 600 seconds.Killing!

Also in hadoop docs you can find the following:

By default 600000 is The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.

So if your mapper work takes long time to emit any output, you can change default timeout using option:

--jobconf mapred.task.timeout=10000000

But it still didn't work for me, because my mapper task could take days or weeks to finish its work.

So I decided simply update status string every X seconds in my python script. Let's go back to hadoop docs again and read:

How do I update status in streaming applications?  A streaming process can use the stderr to emit status information. To set a status, reporter:status:<message> should be sent to stderr.


In python code this can be done like this:

print >> sys.stderr, "reporter:status:" + "hello_hadoop"

No comments:

Post a Comment