Friday, February 6, 2015
EMR task/step restarted every 10 minutes. How to fix in python.
I was playing around with mapreduce amazon service (EMR) and noticed, that it keeps restarting my mapper.py every 10 minutes. After quick search I found the reason. It happens that hadoop was killing task after 600 seconds of timeout, it can be also seen in the log file:
Also in hadoop docs you can find the following:
So if your mapper work takes long time to emit any output, you can change default timeout using option:
But it still didn't work for me, because my mapper task could take days or weeks to finish its work.
So I decided simply update status string every X seconds in my python script. Let's go back to hadoop docs again and read:
In python code this can be done like this:
Task attempt .... failed to report status for 600 seconds.Killing!
Also in hadoop docs you can find the following:
By default 600000 is The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.
So if your mapper work takes long time to emit any output, you can change default timeout using option:
--jobconf mapred.task.timeout=10000000
But it still didn't work for me, because my mapper task could take days or weeks to finish its work.
So I decided simply update status string every X seconds in my python script. Let's go back to hadoop docs again and read:
How do I update status in streaming applications? A streaming process can use the stderr to emit status information. To set a status, reporter:status:<message> should be sent to stderr.
In python code this can be done like this:
print >> sys.stderr, "reporter:status:" + "hello_hadoop"
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment