Web scraping using socks/http proxies

Sunday, October 25, 2015

Web scraping using socks/http proxies

While extracting data from websites most probably you will notice some kind of access limiting for a single IP address. In that case it is good to use proxies. In my scraping projects I've been using both http and socks proxies.

http proxies

In my opinion the most easy and powerful python tool that supports http proxies is requests library. To use http proxy you just need to do something like this:

import requests
my_proxy = { 
              "http"  : "http://127.0.0.1:8080", 
              "https" : "http://127.0.0.1:8081"
            }
r = requests.get("http://google.com", proxies=my_proxy)

socks proxies

Using socks proxies while web scraping in python is a bit more tricky. The problem is that latest version of requests doesn't support socks proxies. But I found another good solution - you can use urllib2 with additional "PySocks" library. So first of all you need to install PySocks:

pip install PySocks

After that you can use socks proxies to crawl web pages (both http and https) in your python code. Below is a basic example:

import urllib2
import socks
from sockshandler import SocksiPyHandler

socks_host = "127.0.0.1"
socks_port = 8080

opener = urllib2.build_opener(
             SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, socks_host, socks_port)
         )
req = opener.open("http://google.com", timeout=10)
content = req.read()

P.S. If you know better solution how to scrape data with proxies in python, feel free to share it in comments.

2 comments:

UnknownJanuary 5, 2017 at 2:16 PM
Hi. Thanks for your post. I am trying to use Requests to scrape HTTPS site. It seems that when I access HTTP site, the proxies work fine. But when I access HTTPS site, the proxy is completely bypassed, as you can see in the log files below. Here, fr.proxymesh.com is the proxy server, and as you can see, the request goes through the proxymesh server only for HTTP site, not for HTTPS site. The code that I run is exactly the same for both the HTTP and HTTPS sites. Would you be able to share any ideas about how I might resolve this? Thank-you.

Log file:

When accessing the http site:

INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): fr.proxymesh.com
DEBUG:requests.packages.urllib3.connectionpool:"GET http://docs.python-requests.org/en/master/user/quickstart/ HTTP/1.1" 200 None

When accessing the https site:

INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.haskell.org
DEBUG:requests.packages.urllib3.connectionpool:"GET /happy/ HTTP/1.1" 200 None
ReplyDelete
Replies

Add comment