Best Practice to Python Extract Plain Text and HTML Text From PDF with PyMuPDF – Python PDF Operation

By | August 7, 2019

To extract text (plain text or html text) from a pdf file is simple in python, we can use PyMuPDF library, which contains many basic pdf operations. In this tutorial, we will introduce you how to extract text from pdf files with it.

pdf to plain text

Import library

import sys, fitz

Prepare a pdf file

pdf = "F:\\test.pdf"

Open this pdf

doc = fitz.open(pdf)

Extract text page by page

for page in doc:
    text = page.getText("text")
    html_text = page.getText("html")
    print(text)
    print(html_text)

Notice:

1.To extract plain text, we should use page.getText(“text”) method

2.To extract html text, we should use page.getText(“html”) method

PyMuPDF also can extract other types of text, such as xhtml, xml, dict. You can check here more details.

https://pymupdf.readthedocs.io/en/latest/tutorial/#extracting-text-and-images