Use PyPDF2 to extract the PDF annotation

Background:

I use Foxit PDF reader to read PDF books and paper.as well as take some marks and comments in the PDF file.But the PDF reader do not provide the export function,maybe i can find the entry for the function..:)

So I try to use PyPDF to export the comment I written in PDF

The Code:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from PyPDF2 import PdfFileReader
import types
 
def text_extractor(path,output_path):
    text_file = open(output_path,'w')

    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)
        # get the 4 page start from 0
        page = pdf.getPage(4)
        # write the raw text in to file
        contents = page.extractText()
        text_file.write(contents)
        text_file.close()
        # print you comment/annotation text to console
        annotations = page.get("/Annots").getObject()
        for annot in annotations:
                if annot is not None:
                        if annot.getObject().get("/Contents") is not None:
                                print(annot.getObject().get("/Contents").getObject())
                        # .get("/Highlight").getObject())


 
if __name__ == '__main__':
    path = './youPDF.pdf'
    out_path = './test.txt'
    text_extractor(path,out_path)

Finally

When you fill you PDF filename and path ,and run the script . the raw test will save in test.txt file.You comments will show in the console.But the code show above,just process page 4.If you want to export all you comment.You must add a for loop to iterate all pages and extract the text!

Leave a Reply

Your email address will not be published. Required fields are marked *