Python Parse XML Sitemap to Extract Urls: A Simple Guide – Python Tutorial

By | March 26, 2020

If you plan to create a python website spider, you have to extract urls from page content or xml sitemap. In this tutorial, we will introduce how to extract these urls for your website spider.

1.Extract urls from page content

Page content is a string, we can extract urls from this page string. Here is a tutorial.

A Simple Guide to Extract URLs From Python String – Python Regular Expression Tutorial

2.Extract urls from xml sitemap

We often use xml sitemap file to manage our website urls, which is a good way to submit our website links to google webmaster tool. To spider these urls, we can parse this xml sitemap file and get urls.

A xml sitemap file may like:

sitemap xml file example

To parse it, we can do by steps below.

Import xml parser library

We use python xml.dom.minidom package to parse xml sitemap file.

from xml.dom.minidom import parse
import xml.dom.minidom

Load xml sitemap file

We need use xml.dom.minidom to open a xml file to start to parse.

xml_file = r'sitemap/post.xml'

DOMTree = xml.dom.minidom.parse(xml_file)

Get the root node in xml file

We should get the root node of this xml file first, then we can get child nodes easily.

root_node = DOMTree.documentElement

print(root_node.nodeName)

The root node of xml sitemap is: urlset

Get all urls in xml sitemap

We can get urls in loc nodes by root node. Here is an example.

loc_nodes = root_node.getElementsByTagName("loc")
for loc in loc_nodes:
    print(loc.childNodes[0].data)

Notice: we should use loc.childNodes[0].data to show url, because text in loc node is also a text node.