Sunday, October 25, 2015
Web scraping using socks/http proxies
While extracting data from websites most probably you will notice some kind of access limiting for a single IP address. In that case it is good to use proxies. In my scraping projects I've been using both http and socks proxies.
P.S. If you know better solution how to scrape data with proxies in python, feel free to share it in comments.
http proxies
In my opinion the most easy and powerful python tool that supports http proxies is requests library. To use http proxy you just need to do something like this:import requests my_proxy = { "http" : "http://127.0.0.1:8080", "https" : "http://127.0.0.1:8081" } r = requests.get("http://google.com", proxies=my_proxy)
socks proxies
Using socks proxies while web scraping in python is a bit more tricky. The problem is that latest version of requests doesn't support socks proxies. But I found another good solution - you can use urllib2 with additional "PySocks" library. So first of all you need to install PySocks:pip install PySocksAfter that you can use socks proxies to crawl web pages (both http and https) in your python code. Below is a basic example:
import urllib2 import socks from sockshandler import SocksiPyHandler socks_host = "127.0.0.1" socks_port = 8080 opener = urllib2.build_opener( SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, socks_host, socks_port) ) req = opener.open("http://google.com", timeout=10) content = req.read()
P.S. If you know better solution how to scrape data with proxies in python, feel free to share it in comments.
Subscribe to:
Post Comments (Atom)
Hi. Thanks for your post. I am trying to use Requests to scrape HTTPS site. It seems that when I access HTTP site, the proxies work fine. But when I access HTTPS site, the proxy is completely bypassed, as you can see in the log files below. Here, fr.proxymesh.com is the proxy server, and as you can see, the request goes through the proxymesh server only for HTTP site, not for HTTPS site. The code that I run is exactly the same for both the HTTP and HTTPS sites. Would you be able to share any ideas about how I might resolve this? Thank-you.
ReplyDeleteLog file:
When accessing the http site:
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): fr.proxymesh.com
DEBUG:requests.packages.urllib3.connectionpool:"GET http://docs.python-requests.org/en/master/user/quickstart/ HTTP/1.1" 200 None
When accessing the https site:
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.haskell.org
DEBUG:requests.packages.urllib3.connectionpool:"GET /happy/ HTTP/1.1" 200 None
I tried to access https website with requests and https proxy, and I have the following log:
DeleteDEBUG:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): icanhazip.com
DEBUG:requests.packages.urllib3.connectionpool:https://icanhazip.com:443 "GET / HTTP/1.1" 200 None
From this log you cant see my proxy ip, but actually I visited website from proxy. I also tested it on real server and checked nginx logs.
That said, by seeing python log, you can't understand which IP hits your target page.
So to check that your code uses proxies, you can try visiting http/https versions of some website, that returns your ip, and then compare the output. For example, this one: https://icanhazip.com/
I also wrote a new post here about proxies: http://codeinpython.blogspot.com/2017/01/how-to-scrape-https-website-with-proxies.html