Burdoc Logo

Burdoc: Advanced PDF Parsing For Python

About the Project

Burdoc is a python library and script designed to automate the extraction of complex, text-driven content from PDFs. Burdoc generates a rich semantic representation of a PDF, including headings, reading order, tables, and images that can be used for downstream processing.

Why Another PDF Parsing Library?

Excellent question! Between pdfminer, PyMuPDF, Tika, and many others there are a plethora of tools for parsing PDFs, but nearly all are focused on the initial step of pulling out raw content, not on representing the documents actual meaning.

Key Features

  • Rich Document Representation: Burdoc is able to identify most common types of text, including:

    • Paragraphs

    • Headings

    • Lists (ordered and unordered)

    • Headers, footers and sidebars,

    • Visual Asides such as read-out boxes

  • Structured Output: Burdoc generates a comprehensive JSON representation of the text. Unlike many other tools it preserves information such metadata, fonts, and original bounding boxes to give downstream users as much information as is needed.

  • Complex Reading Order Inference: Burdoc uses a multi-stage algorithm to infer reading order even in complex pages with changing numbers of columns, split sections, and asides.

  • ML-Powered Table Extraction: Burdoc makes use of the latest machine learning models for identifying tables, alongside a rules-based approach to identify inline tables.

  • Large Documents: By relying on PyMuPDF rather than pdfminer, the core PDF reading task is substantially faster than other libraries, and can handle large files (~1000s of pages or 100s of Mbs in size) with ease. Running a single page through Burdoc can be quite slow due to expensive initialisation requirements and takes O(2s) but with GPU acceleration and multithreading support we can process documents at 0.2-0.5s/page

Limitations

  • OCR: As Burdoc relies on high-precision font and location information for it’s processing it is likely to perform badly when parsing OCR’d files.

  • Right-to-Left Text: All parsing is for left-to-right languages only.

  • Complex Figures: Areas with large amounts of text arranged around figures in a arbitrary fashion will not be extracted correctly.

  • Forms: Currently Burdoc has no way to recognise complex forms.