Best Practice to Python Extract Plain Text and HTML Text From PDF with PyMuPDF – Python PDF Operation

By | August 7, 2019

To extract text (plain text or html text) from a pdf file is simple in python, we can use PyMuPDF library, which contains many basic pdf operations. In this tutorial, we will introduce you how to extract text from pdf files with it.

pdf to plain text

Import library

import sys, fitz

Prepare a pdf file

pdf = "F:\\test.pdf"

Open this pdf

doc = fitz.open(pdf)

Extract text page by page

for page in doc:
    text = page.getText("text")
    html_text = page.getText("html")
    print(text)
    print(html_text)

Notice:

1.To extract plain text, we should use page.getText(“text”) method

2.To extract html text, we should use page.getText(“html”) method

PyMuPDF also can extract other types of text, such as xhtml, xml, dict. You can check here more details.

https://pymupdf.readthedocs.io/en/latest/tutorial/#extracting-text-and-images

2 thoughts on “Best Practice to Python Extract Plain Text and HTML Text From PDF with PyMuPDF – Python PDF Operation

  1. Jack

    thank you so much for the article. I have a problem I have a number of pdf files and I want to extract text from the first page of each pdf file and save the text either to a text file or CSV file.

    Thank you

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *