Best Practice to Python Extract Plain Text and HTML Text From PDF with PyMuPDF

To extract text (plain text or html text) from a pdf file is simple in python, we can use PyMuPDF library, which contains many basic pdf operations. In this tutorial, we will introduce you how to extract text from pdf files with it.

Import library

import sys, fitz

Prepare a pdf file

pdf = "F:\\test.pdf"

Open this pdf

doc = fitz.open(pdf)

Extract text page by page

for page in doc:
    text = page.getText("text")
    html_text = page.getText("html")
    print(text)
    print(html_text)

Notice:

1.To extract plain text, we should use page.getText(“text”) method

2.To extract html text, we should use page.getText(“html”) method

PyMuPDF also can extract other types of text, such as xhtml, xml, dict. You can check here more details.

https://pymupdf.readthedocs.io/en/latest/tutorial/#extracting-text-and-images

2 thoughts on “Best Practice to Python Extract Plain Text and HTML Text From PDF with PyMuPDF – Python PDF Operation”

Jack June 4, 2020

thank you so much for the article. I have a problem I have a number of pdf files and I want to extract text from the first page of each pdf file and save the text either to a text file or CSV file.

Thank you

↓

admin Post authorJune 4, 2020

In order to extract text of first page, you can use pymupdf. Moreover, some pdf files can not be extracted, because these pdf files may be created by scanner, in this situation, you can extract text from images using python. Convert first page of pdf to image then extract text. More details on here: https://www.tutorialexample.com/python-pdf-document-processing-notes-for-beginners/

Log in to Reply ↓

Import library

Prepare a pdf file

Open this pdf

Extract text page by page

2 thoughts on “Best Practice to Python Extract Plain Text and HTML Text From PDF with PyMuPDF – Python PDF Operation”

Leave a Reply Cancel reply