Sunday, November 8, 2015

Using Python passlib in Java applications

Some of my readers might ask why would python developer need to do something in java? I've never thought I would need to code in java too. But if you saw my last post, I was talking about how to implement SSO with django website. And one of its main components is java based application, which I wanted to customize.

So here is the problem: I needed to verify password hashes in java application (java 7), but those hashes were generated with python passlib library (pbkdf2_sha512). First I even tried to implement password verification in java, but then I gave up and decided to do it easy way.

How to setup Shibboleth Identity Provider 3 with Django Website

Recently I was given a task to setup Shibboleth SSO to integrate different python websites with single sign on. There is official shibboleth documentation, but it's quite difficult to understand, especially for beginners. After spending days of googling and reading different forums, I decided to create step-by-step guide for beginners on how to install Shibboleth SSO with django website.

Web scraping using socks/http proxies

While extracting data from websites most probably you will notice some kind of access limiting for a single IP address. In that case it is good to use proxies. In my scraping projects I've been using both http and socks proxies.

Using django models in standalone application

One of my favorite things about Django is its loosely coupled design. It means that you can use each part independently and not affect others. And one of the parts that I often use in my applications is Djagno Models.

There is nothing difficult to use django models in standalone application. You just need to start new django project and then remove all unnecessary stuff leaving only models. Below I will show it on example.

Wanna quit your job and become Upwork freelancer? Do not do that!

Almost every day I receive job offers on my 5 star odesk (upwork) profile. I do not search anything manually, clients contacting me directly. But most of these job offers are really garbage.

Real example. Couple days ago some R.D. from United States contacted me. She wanted to do website scrape.

Ok, after doing quick website check, job seems straightforward. Huge list of categories/subcategories and product listings. Nothing unusual.

The next part is to set the price for the job. Usually I would say: Ok, data scrape will cost you $75. Then client approves/declines. But this time I decided to give chance my client decide how much she would like to pay. So I simply said:

"Just send me an offer with price that is ok for you."

Then she asks me if I can also deliver script. I said that I can deliver script as well, but it would cost more expensive, since I need to clean the code, make it user friendly and easy to run. And what is the result? How much do you think she evaluated my work? Below is her answer:

"Since this site is so big I was thinking having the script itself might make sense rather than reaching out each time for every random vertical - otherwise I'd have to ping you again. We have a dev team here and I studied CS, so I don't need it to be beautiful code, just functional. 😄

But for right now, let's start with the data for the food one. I had been paying $3/hr for manual data entry. Would $15 be fair to you for this category?"

Great. She had been paying $3/hr for manual copy paste for a person without any programming skills. My skills she evaluates the same - $3/hr. Even my profile states $15 per hour. Or probably she's thinking that's possible to develop script, scrape the data and deliver functional code to her just in 1 hour at $15? No further discussion.

Where these all people are from, who want me to work for $1-$3/hr? Guys, this is not serious and not funny. I'm saying NO to free labor!

P.S. I'm not saying that all of the clients on odesk are like described above. But this is just very common example of what I've experienced so far.

Note to client: If the task is so easy and you even studied CS and have dev team, why should you waste your time searching, interviewing and hiring programmer on odesk? Wouldn't it faster to write so "easy" script by yourself? It shouldn't take more than 5 minutes for such professional like you.

Friday, October 2, 2015

How to setup nginx+uwsgi with CKAN

Recently I had to deploy CKAN website on web server. There is nothing difficult if just you follow official documentation and deploy CKAN with Apache. But my choice was uwsgi+nginx, because I always use this bundle.

After setting everything up like I usually do with django, I got below error in nginx log files for all static files:

[error] 1057#0: *70 upstream prematurely closed connection while reading response header from upstream

s3 upload large files to amazon using boto

Recently I had to upload large files (more than 10 GB) to amazon s3 using boto. But when I tried to use standard upload function set_contents_from_filename, it was always returning me: ERROR 104 Connection reset by peer.

After quick search I figured out that Amazon does not allow direct upload of files larger than 5GB. In order to upload file greater than 5GB we must use multipart upload, i.e. divide large file into smaller pieces and upload each piece separately. But I didn't want to cut my file phisically because I didn't have much disk space. Luckily there is a great solution - we can use file pointer and set number of bytes we want to upload per time. Below is my function that you can use to upload large files to amazon:

How to disable Authentication/Password require on using Modem Manager (Debian Jessie)

I'm using Modem Manager in Debian Jessie and every time when I launch Modem Manager, it requires me to enter admin password:

This is very annoying to enter password on sending/receiving every message. And after quick search I found very simple solution. Below is detailed instruction.

How to remove MATE from Debian Jessie

I installed MATE environment that I wanted to try out on Debian Jessie along with xfce4, but I didn't like it at all. And then obviously I wanted to remove MATE desktop.

First I found solution on this website, which recommended to use this long command to remove MATE:

EMR task/step restarted every 10 minutes. How to fix in python.

I was playing around with mapreduce amazon service (EMR) and noticed, that it keeps restarting my mapper.py every 10 minutes. After quick search I found the reason. It happens that hadoop was killing task after 600 seconds of timeout, it can be also seen in the log file:

Task attempt .... failed to report status for 600 seconds.Killing!

Also in hadoop docs you can find the following:

By default 600000 is The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.

So if your mapper work takes long time to emit any output, you can change default timeout using option:

--jobconf mapred.task.timeout=10000000

But it still didn't work for me, because my mapper task could take days or weeks to finish its work.

So I decided simply update status string every X seconds in my python script. Let's go back to hadoop docs again and read:

How do I update status in streaming applications? A streaming process can use the stderr to emit status information. To set a status, reporter:status:<message> should be sent to stderr.

In python code this can be done like this:

print >> sys.stderr, "reporter:status:" + "hello_hadoop"

linux command to extract .gz, .tar.bz2, .tar.gz files

I always forget how to decompress/extract certain file types in linux. Below I publish list of linux commands to decompress different archives:

Linux extract commands

.gz:

gunzip filename.gz

coommand will delete original file after extraction

.bz2:

bzip2 -d filename.bz2

coommand will delete original file after extraction

bzip2 -dk filename.bz2

this command also extracts but keep original

.tar.gz

tar -xvzf filename.tar.gz

x: tar can collect files or extract them. x does the latter.
v: makes tar talk a lot. Verbose output shows you all the files being extracted.
z: tells tar to decompress the archive using gzip
f: this must be the last flag of the command, and the tar file must be immediately after. It tells tar the name and path of the compressed file.

Sunday, February 1, 2015

Python: No crypto library available in Windows. How To FIx.

While developing some script with google drive api in order to upload files to google drive, I faced with the following error:


oauth2client\crypt.py
oauth2client.client.CryptoUnavailableError: No crypto library available

This is how I fixed it in ubuntu:
apt-get install python-openssl

And this is how to fix in Windows:
pip install pyopenssl

Code in Python Blog

Sunday, November 8, 2015

Using Python passlib in Java applications

Friday, November 6, 2015

How to setup Shibboleth Identity Provider 3 with Django Website

Sunday, October 25, 2015

Web scraping using socks/http proxies

Wednesday, October 21, 2015