To extract text (plain text or html text) from a pdf file is simple in python, we can use PyMuPDF library, which contains many basic pdf operations. In this tutorial, we will introduce you how to extract text from pdf files with it.
Import library
import sys, fitz
Prepare a pdf file
pdf = "F:\\test.pdf"
Open this pdf
doc = fitz.open(pdf)
Extract text page by page
for page in doc: text = page.getText("text") html_text = page.getText("html") print(text) print(html_text)
Notice:
1.To extract plain text, we should use page.getText(“text”) method
2.To extract html text, we should use page.getText(“html”) method
PyMuPDF also can extract other types of text, such as xhtml, xml, dict. You can check here more details.
https://pymupdf.readthedocs.io/en/latest/tutorial/#extracting-text-and-images
thank you so much for the article. I have a problem I have a number of pdf files and I want to extract text from the first page of each pdf file and save the text either to a text file or CSV file.
Thank you
In order to extract text of first page, you can use pymupdf. Moreover, some pdf files can not be extracted, because these pdf files may be created by scanner, in this situation, you can extract text from images using python. Convert first page of pdf to image then extract text. More details on here: https://www.tutorialexample.com/python-pdf-document-processing-notes-for-beginners/