Thursday, January 5, 2017
How to scrape https website with proxies
Hi all,
My last post about scraping with proxies is quite old and I decided to write a newer version of it. In particular, today I will emphasize how to scrape https website with proxies.
There are also good news about requests library. Requests has not been supporting socks proxies for quite a long time, but in 2016 there was a new release of it. So now requests fully supports both http and socks proxies.
So let's get started. Below I will show you 4 different examples of how to scrape a single https page. First, we will scrape it with requests using socks and http proxies. Second, we will do the same using urllib3 library.
In order to use requests and urllib3, you need to install requirements first.
For the time of writing, I have these versions:
I tested all 4 cases using http, https and socks proxies. I checked nginx log on the real server to make sure that IP that hits https website is the proxy one. I only didn't try it with proxies that require authentication.
If you have something to add, please let me know.
My last post about scraping with proxies is quite old and I decided to write a newer version of it. In particular, today I will emphasize how to scrape https website with proxies.
There are also good news about requests library. Requests has not been supporting socks proxies for quite a long time, but in 2016 there was a new release of it. So now requests fully supports both http and socks proxies.
So let's get started. Below I will show you 4 different examples of how to scrape a single https page. First, we will scrape it with requests using socks and http proxies. Second, we will do the same using urllib3 library.
Requirements
In order to use requests and urllib3, you need to install requirements first.
pip install requests[socks] pip install urllib3[socks,secure]
For the time of writing, I have these versions:
PySocks==1.6.5 pyOpenSSL==16.2.0 certifi==2016.9.26 requests==2.12.4 urllib3==1.19.1
1.1 requests with https proxy
import requests url = "https://icanhazip.com/" r = requests.get(url, #verify=False, proxies=dict(http='https://x.x.x.x:xxxx/', https='https://x.x.x.x:xxxx/' )) print r.status_code
1.2 requests with socks proxy
import requests url = "https://icanhazip.com/" r = requests.get(url, #verify=False, proxies=dict(http='socks5://x.x.x.x:xxxx', https='socks5://x.x.x.x:xxxx' )) print r.status_code
2.1 urllib3 with https proxy
import certifi from urllib3 import ProxyManager url = "https://icanhazip.com/" proxy = ProxyManager( 'https://x.x.x.x:xxxx/', #cert_reqs='CERT_NONE', cert_reqs='CERT_REQUIRED', ca_certs=certifi.where() ) r = proxy.request('GET', url) print r.status
2.2 urllib3 with socks proxy
import certifi from urllib3.contrib.socks import SOCKSProxyManager url = "https://icanhazip.com/" proxy = SOCKSProxyManager( 'socks5://x.x.x.x:xxxx/', #cert_reqs='CERT_NONE', cert_reqs='CERT_REQUIRED', ca_certs=certifi.where() ) r = proxy.request('GET', url) print r.status
I tested all 4 cases using http, https and socks proxies. I checked nginx log on the real server to make sure that IP that hits https website is the proxy one. I only didn't try it with proxies that require authentication.
If you have something to add, please let me know.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment