This tutorial is in: Python PDF Document Processing Notes for Beginners
When we plan to display pdf books on a site, one of important information on this pdf book is bookmarks, which is very useful to visitors. How to extract bookmarks of a pdf? In this tutorial, we will use python pymupdf library to get it.
How to get pdf bookmarks?
The bookmarks of a pdf is a meta information: outline. Most of python libraries extract it as to bookmarks, which means if there does not exist outline meta, you will get an empty string.
How to extract pdf bookmarks using pymupdf library?
It is very easy to extract bookmarks using pymupdf.
Here is an example code.
file = r'F:\PDF-Documents\Standard-Books\1\the-hitchhiker-s-guide-to-python-58884.pdf' bookmark = '' try: doc = fitz.open(file) toc = doc.getToC(simple = True) print(type(toc)) print(toc) bookmark = parseBookmar(toc) print(bookmark) except Exception as e: print(e)
1.We use fitz.open(file) to open a pdf file first.
2.Then we will use doc.getToC(simple = True) to extract pdf bookmarks and get toc object, which is pdf bookmarks.
Run this code, you will get the bookmarks.
<class 'list'> [[1, 'Copyright', 4], [1, 'Table of Contents', 7], [1, 'Preface', 13], [2, 'Conventions Used in This Book', 14]]
From the result, we can find:
1.The object toc is a python list.
2.The format of a bookmark likes:
[layer, name, page]
layer: it is the layer of bookmarks
name: the name of bookmarks
page: the page of bookmarks located in pdf.
If the pdf file does not contain any outline meta information, you will get an empty python list:.
After you have got the pdf bookmarks, you can convert it to json to share or save into database.
Converting to jsone
Save json to database