Saturday, October 24, 2015

Web scraping using socks/http proxies

While extracting data from websites most probably you will notice some kind of access limiting for a single IP address. In that case it is good to use proxies. In my scraping projects I've been using both http and socks proxies.

http proxies

In my opinion the most easy and powerful python tool that supports http proxies is requests library. To use http proxy you just need to do something like this:
import requests
my_proxy = { 
              "http"  : "http://127.0.0.1:8080", 
              "https" : "http://127.0.0.1:8081"
            }
r = requests.get("http://google.com", proxies=my_proxy)

socks proxies

Using socks proxies while web scraping in python is a bit more tricky. The problem is that latest version of requests doesn't support socks proxies. But I found another good solution - you can use urllib2 with additional "PySocks" library. So first of all you need to install PySocks:
pip install PySocks
After that you can use socks proxies to crawl web pages (both http and https) in your python code. Below is a basic example:
import urllib2
import socks
from sockshandler import SocksiPyHandler

socks_host = "127.0.0.1"
socks_port = 8080

opener = urllib2.build_opener(
             SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, socks_host, socks_port)
         )
req = opener.open("http://google.com", timeout=10)
content = req.read()

P.S. If you know better solution how to scrape data with proxies in python, feel free to share it in comments.

2 comments:

  1. Hi. Thanks for your post. I am trying to use Requests to scrape HTTPS site. It seems that when I access HTTP site, the proxies work fine. But when I access HTTPS site, the proxy is completely bypassed, as you can see in the log files below. Here, fr.proxymesh.com is the proxy server, and as you can see, the request goes through the proxymesh server only for HTTP site, not for HTTPS site. The code that I run is exactly the same for both the HTTP and HTTPS sites. Would you be able to share any ideas about how I might resolve this? Thank-you.

    Log file:

    When accessing the http site:

    INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): fr.proxymesh.com
    DEBUG:requests.packages.urllib3.connectionpool:"GET http://docs.python-requests.org/en/master/user/quickstart/ HTTP/1.1" 200 None


    When accessing the https site:

    INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.haskell.org
    DEBUG:requests.packages.urllib3.connectionpool:"GET /happy/ HTTP/1.1" 200 None

    ReplyDelete
    Replies
    1. I tried to access https website with requests and https proxy, and I have the following log:
      DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): icanhazip.com
      DEBUG:requests.packages.urllib3.connectionpool:https://icanhazip.com:443 "GET / HTTP/1.1" 200 None

      From this log you cant see my proxy ip, but actually I visited website from proxy. I also tested it on real server and checked nginx logs.

      That said, by seeing python log, you can't understand which IP hits your target page.

      So to check that your code uses proxies, you can try visiting http/https versions of some website, that returns your ip, and then compare the output. For example, this one: https://icanhazip.com/

      I also wrote a new post here about proxies: http://codeinpython.blogspot.com/2017/01/how-to-scrape-https-website-with-proxies.html

      Delete