To extract text (plain text or html text) from a pdf file is simple in python, we can use PyMuPDF library, which contains many basic pdf operations. In this tutorial, we will introduce you how to extract text from pdf files with it.
Import library
import sys, fitz
Prepare a pdf file
pdf = "F:\\test.pdf"
Open this pdf
doc = fitz.open(pdf)
Extract text page by page
for page in doc: text = page.getText("text") html_text = page.getText("html") print(text) print(html_text)
Notice:
1.To extract plain text, we should use page.getText(“text”) method
2.To extract html text, we should use page.getText(“html”) method
PyMuPDF also can extract other types of text, such as xhtml, xml, dict. You can check here more details.
https://pymupdf.readthedocs.io/en/latest/tutorial/#extracting-text-and-images