Python Detect Web Page Content Charset Type – Python Web Crawler Tutorial

By | July 23, 2019

To crawl web page content correctly, you must be sure the content charset type of content string. However, there are some types of charsets, such as utf-8, gbk, gb2312 et al. In this tutorial, we will introduce a way to detect the charset type of content string using python.

html charset

The importance of detecting content string charset type

If you do not determine the charset type, you may

1.fail to convert a byte string to string

Fix UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 0 – Python Tutorial

2.fail to save a string to file.

Fix Python File Write Error: UnicodeEncodeError: ‘gbk’ codec can’t encode character – Python Tutorial

How to detect the charset type of web page

One of most basic methods is to extract it from web page source code.

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta data-rh="true" charset="utf-8"/>

Here in html meta tag, there exists charest value of this page.

In this tutorial, we will use http response object and python chardet library to detect string charset.

Preliminaries

Get a http response object: crawl_response

To get this object, you can read this article.

A Simple Guide to Use urllib to Crawl Web Page in Python 3 – Python Web Crawler Tutorial

Get http response message

message = crawl_response.info()

Get content charset

charset = message .get_content_charset(None)

However, this method may fail. So we should detect continuely.

    if not charset:
        charset = message.get_charsets(None)
        if not charset:
            #continue
        else:
            charset = charset[0]

However, message.get_charsets() also may fail if there is no meta charest tag in html page. At this situation, we will use chardet library to detect.

        if not charset:
            import chardet
            result=chardet.detect(content)
            charset=result['encoding']

chardet library can detect the most probably charset by content string. However, it has two questions:

1.Html page is gbk, it may return gb2312, which means it may return a different value if you use message .get_content_charset(None)

2.It also may return None

So we should set charest default value is utf-8.

    if not charset: # default set utf-8
        charset = 'utf-8

The full python detect code is here.

def detectCharest(crawl_response):
    charset = None
    message = crawl_response.info()
    charset = message .get_content_charset(None)
    print(charset)
    if not charset:
        charset = message.get_charsets(None)
        if not charset:
            import chardet
            result=chardet.detect(content)
            charset=result['encoding']
        else:
            charset = charset[0]
    if not charset: # default set utf-8
        charset = 'utf-8'
    return charset

 

Leave a Reply

Your email address will not be published. Required fields are marked *