Best Practice to Python OpenDirector Add HTTP Request Header – Python Web Crawler Tutorial

By | August 10, 2019

To crawl a web page, we should add some http request headers to our crawler to simulate browser. We can use urllib.request.Request() to build a request object to add some headers to do it.

A Simple Guide to Use urllib to Crawl Web Page in Python 3 – Python Web Crawler Tutorial

Meanwhile, we also can use urllib.request.build_opener() to create an OpenDirector object to crawl a web page. In this tutorial, we will add some request headers to OpenDirector object to simulate browser.

python OpenDirector http request header

Import library

import urllib.request
import ssl

Create an OpenDirector object ignoring ssl

    context=ssl._create_unverified_context()  
    sslHandler = urllib.request.HTTPSHandler(context=context)

    opener = urllib.request.build_opener(sslHandler)

Add http request header to opener

    headers = []
    headers.append(('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'))
    headers.append(('Accept-Encoding', 'gzip, deflate, br'))
    headers.append(('Accept-Language', 'zh-CN,zh;q=0.9'))
    headers.append(('Cache-Control', 'max-age=0'))
    headers.append(('Referer', 'https://www.google.com/'))
    headers.append(('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'))

    opener.addheaders = headers

Then we build a function to create this object.

def getRequestOpener():
        
    context=ssl._create_unverified_context()  
    sslHandler = urllib.request.HTTPSHandler(context=context)

    opener = urllib.request.build_opener(sslHandler)

    headers = []
    headers.append(('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'))
    headers.append(('Accept-Encoding', 'gzip, deflate, br'))
    headers.append(('Accept-Language', 'zh-CN,zh;q=0.9'))
    headers.append(('Cache-Control', 'max-age=0'))
    headers.append(('Referer', 'https://www.google.com/'))
    headers.append(('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'))

    opener.addheaders = headers

    return opener