Thursday, January 5, 2017

How to scrape https website with proxies

Hi all,

My last post about scraping with proxies is quite old and I decided to write a newer version of it. In particular, today I will emphasize how to scrape https website with proxies.

There are also good news about requests library. Requests has not been supporting socks proxies for quite a long time, but in 2016 there was a new release of it. So now requests fully supports both http and socks proxies.

So let's get started. Below I will show you 4 different examples of how to scrape a single https page. First, we will scrape it with requests using socks and http proxies. Second, we will do the same using urllib3 library.

Requirements


In order to use requests and urllib3, you need to install requirements first.
pip install requests[socks]
pip install urllib3[socks,secure]

For the time of writing, I have these versions:
PySocks==1.6.5
pyOpenSSL==16.2.0
certifi==2016.9.26
requests==2.12.4
urllib3==1.19.1

1.1 requests with https proxy

import requests
url = "https://icanhazip.com/"

r = requests.get(url,
        #verify=False,
        proxies=dict(http='https://x.x.x.x:xxxx/',
                https='https://x.x.x.x:xxxx/'
        ))

print r.status_code


1.2 requests with socks proxy

import requests
url = "https://icanhazip.com/"

r = requests.get(url,
        #verify=False,
        proxies=dict(http='socks5://x.x.x.x:xxxx',
                https='socks5://x.x.x.x:xxxx'
        ))

print r.status_code


2.1 urllib3 with https proxy

import certifi
from urllib3 import ProxyManager

url = "https://icanhazip.com/"

proxy = ProxyManager(
        'https://x.x.x.x:xxxx/',
        #cert_reqs='CERT_NONE',
        cert_reqs='CERT_REQUIRED',
        ca_certs=certifi.where()
        )

r = proxy.request('GET', url)

print r.status

2.2 urllib3 with socks proxy

import certifi
from urllib3.contrib.socks import SOCKSProxyManager

url = "https://icanhazip.com/"

proxy = SOCKSProxyManager(
        'socks5://x.x.x.x:xxxx/',
        #cert_reqs='CERT_NONE',
        cert_reqs='CERT_REQUIRED',
        ca_certs=certifi.where()
        )

r = proxy.request('GET', url)

print r.status

I tested all 4 cases using http, https and socks proxies. I checked nginx log on the real server to make sure that IP that hits https website is the proxy one. I only didn't try it with proxies that require authentication.

If you have something to add, please let me know.

No comments:

Post a Comment