Sunday, November 27, 2016

Why Python is not good for multi-threading?

Recently I was asked this question during screening interview at Yandex (Russian search engine), and they screened fairly well from me. They said: you're cool guy, but try again after a year. You're ok for junior, but not for senior position.

Long time ago I read on some blog that multi-threading is not good idea for Python. That's the only thing came to my mind at the interview. So I only answered that's not good idea, because it will require a lot of memory. Quite silly answer.

Then the interviewer said that it's somehow related to GIL. What's GIL??? It sounded like some kind of familiar and intelligent word to me.

After that, I googled this blog which explained me why Python is not good for multi-threading. Shortly speaking, all problems come from that GIL - Global Interpreter Lock. As result Python can only execute one thread at a time. If you'd like to start many threads, all of them will be competing for a single lock (GIL). Just remember that. You can't execute multiple threads simultaneously in Python. That's one of Python disadvantages and one of popular question at interviews.

OK. This question and interview was a good learning experience.

But after a bit of thinking I realized, that a year ago, I already tried to scrape website in parallel using Python and Threads and it worked really well. Check this tool. It creates queue of urls, then starts pool of threads. Each thread gets one url from a queue and then retrieves data from website using requests.get or urllib2.urlopen. But why does it work fast? And why GIL is not a problem here?

The reason this multi-threading technique works is that during network IO, thread releases GIL, so another one can acquire it and continue its work, while the first one getting data over network. Now I know it's some kind of hack and better to use proper tool for that - choices are tornado, celery and more.

No comments:

Post a Comment