This tutorial is in: Python PDF Document Processing Notes for Beginners
Python can split a big pdf file to some small ones, meanwhile, we also can merge some small pdf files to a big one. In this tutorial, we will introduce how to split and merge pdf files using python pymupdf library.
You should install python pymupdf library first.
pip install pymupdf
Open a source pdf file
To split or merge a pdf file, you should open a source pdf first. To open a pdf file in python pymupdf, we can do like this:
import sys, fitz file = '231420-digitalimageforensics.pdf' try: doc = fitz.open(file) except Exception as e: print(e) page_count = doc.pageCount print(page_count)
Run this code, you will find the total page of source document (231420-digitalimageforensics.pdf) is: 199.
Then we can split some pages from the source pdf to a new pdf.
To split or merge pdf files in pymupdf, we can use Document.insertPDF() function.
insertPDF(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True)
This function can select some pages from docsrc to insert into a new pdf.
The index of pages in a pdf document
In python pymupdf, the index of page starts with 0, which means the page index is in [0, total_page – 1].
This is very important if you plan to select some pages from a source pdf file.
Important parameters explain
docsrc: a source pdf file, we can select some page [from_page, to_page].
As to [from_page = 3, to_page = 5], which means we will select 3 pages (page 4, page 5, page 6) from a source pdf.
from_page: int, the start index of page in docsrc.
to_page: int, the end index of page in docsrc, you should notice this index page is also selected.
start_at: int, this parameter determines where to insert pages from docsrc.
For exampe: start_at = 1, which means we will insert pages from docsrc in between page index 0 and page index 1 in destination pdf file.
Menwhile, start_at should be smaller than the total page of destination pdf file.
doc2 = fitz.open("new-doc-1.pdf") doc2.insertPDF(doc, from_page = 3, to_page = 5, start_at = 1) doc2.save("new-doc-4.pdf")
This code will select 3 pages from 231420-digitalimageforensics.pdf. Then, we will insert these pages into the end of first page of new-doc-1.pdf to create a new pdf document new-doc-4.pdf.
This code can split a pdf file and merge two pdf files to a new one.