A Simple Guide to Use urllib to Crawl Web Page in Python 3

Python 3 urllib is a package that helps us to open urls. It contains four parts:

urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files

urllib.request and urllib.parse are most used in python applications, In this tutorial, we will introduce how to crawl web page using python 3 urllib.

Preliminaries

# -*- coding:utf-8 -*-
import urllib.request

Set start url you want to crawl

start_url = "https://www.alexa.com/siteinfo/tutorialexample.com"

Build a http request object

We use http request object to connect web server and craw web page.

req = urllib.request.Request(start_url)

Add http request header for your request object

An Easy Guide to Get HTTP Request Header List for Beginners – Python Web Crawler Tutorial

#add request header
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8')
req.add_header('Accept-Encoding', 'gzip, deflate, br')
req.add_header('Accept-Language', 'zh-CN,zh;q=0.9')
req.add_header('Cache-Control', 'max-age=0')
req.add_header('Referer', 'https://www.google.com/')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')

Crawl web page and get http response object

response = urllib.request.urlopen(req)

If you want to know what variables and functions in response object. you can read this tutorial.

A Simple Way to Find Python Object Variables and Functions – Python Tutorial

Check response code and get web page content

response_code = response.status
if response_code == 200:
    content = response.read().decode("utf8")
    print(content)

Then a basic web page crawler is built.