Python Extract PDF Bookmarks Using PyMuPDF: A Step Guide for Beginner

By | April 23, 2020

This tutorial is in: Python PDF Document Processing Notes for Beginners

When we plan to display pdf books on a site, one of important information on this pdf book is bookmarks, which is very useful to visitors. How to extract bookmarks of a pdf? In this tutorial, we will use python pymupdf library to get it.

How to get pdf bookmarks?

The bookmarks of a pdf is a meta information: outline. Most of python libraries extract it as to bookmarks, which means if there does not exist outline meta, you will get an empty string.

How to extract pdf bookmarks using pymupdf library?

It is very easy to extract bookmarks using pymupdf.

Here is an example code.

file = r'F:\PDF-Documents\Standard-Books\1\the-hitchhiker-s-guide-to-python-58884.pdf'
bookmark = ''
try:
    doc = fitz.open(file) 
    toc = doc.getToC(simple = True)
    print(type(toc))
    print(toc)
    bookmark = parseBookmar(toc)
    print(bookmark)
except Exception as e:
    print(e)

Example explain

1.We use fitz.open(file) to open a pdf file first.

2.Then we will use doc.getToC(simple = True) to extract pdf bookmarks and get toc object, which is pdf bookmarks.

Run this code, you will get the bookmarks.

<class 'list'>
[[1, 'Copyright', 4], [1, 'Table of Contents', 7], [1, 'Preface', 13], [2, 'Conventions Used in This Book', 14]]

From the result, we can find:

1.The object toc is a python list.

2.The format of a bookmark likes:

[layer, name, page]

layer: it is the layer of bookmarks

name: the name of bookmarks

page: the page of bookmarks located in pdf.

python extract pdf bookmarks using pymupdf

If the pdf file does not contain any outline meta information, you will get an empty python list:[].

After you have got the pdf bookmarks, you can convert it to json to share or save into database.

Converting to jsone

Python Convert List to Json to Share Data: A Beginner Guide

Save json to database

Store JSON Data into MySQL Using Python: A Simple Guide

Leave a Reply

Your email address will not be published. Required fields are marked *