BurdocParser

BurdocParser provides the primary interface for extracting PDFs. It builds a processing chain, based on user configuration and extracts content.

class burdoc.burdoc_parser.BurdocParser(detailed: bool = False, skip_ml_table_finding: bool = False, ignore_images: bool = False, max_threads: int | None = None, log_level: int = 20, show_pages: bool = False)

Top-level class to extract structured content from PDF.

Example Usage: `python from burdoc import BurdocParser content = BurdocParser.read("file.pdf") `

__init__(detailed: bool = False, skip_ml_table_finding: bool = False, ignore_images: bool = False, max_threads: int | None = None, log_level: int = 20, show_pages: bool = False)

Instantiate a BurdocParser. Note that one of either html_out or json_out must be true

Parameters:
  • detailed (bool) – Include detailed information such as font statistics and bounding boxes in the output

  • skip_ml_table_finding (bool) – Whether to use ML table finding algorithms.

  • ignore_images (bool) – Don’t extract any images from the document. Much faster but prone to errors if images used as layout elements.

  • max_threads (Optional[int], optional) – Maximum number of threads to run. Set to None to use default system limits or 1 to force single-threaded mode. Defaults to None.

  • log_level (int, optional) – Defaults to logging.INFO.

  • show_pages (bool, optional) – Draw each page as it’s extracted with extraction information laid on top. Primarily for debugging. Defaults to False.

Raises:

ImportError – transformer library detected but loading transformer library failed.

print_profile_info()

Print performance profile for last run

read(path: str, pages: List[int] | None = None, extract_images: bool = True, extract_page_images: bool = False, extract_page_hierarchy: bool = True) Any

Read a PDF and output a structured response

Parameters:
  • path (str) – Path of the pdf to load

  • pages (Optional[List[int]], optional) – List of pages to extract. Defaults to None.

  • extract_images – (bool): Extract images from PDF. This can cause the output to become extremely large. Default is False

  • extract_page_images – (bool): Extract the page images rendered as part of the processing. Default is False

  • extract_page_hierarchy – Extract a list of headings and titles. Default is False.

Raises:
  • FileNotFoundError – If the file cannot be found.

  • EmptyFileError – If the file has zero length. Subclass of FileDataError and RuntimeError

  • ValueError – If unknown file type is specified. Subclass of RuntimeError

  • FileDataError – If the document has an invalid structure for the given type. Subclass of RuntimeError

Returns:

Structured content, has format {

’metadata’ (Dict[str, Any]): Any metadata about the file itself ‘content’ (Dict[int, List[Any]]): Ordered content organised per-page ‘page_hierarchy (Dict[int, List[Any]]): Headers found in each page ‘images’, (Dict[int, List[PIL.Image.Image]], optional): Images extracted from

each page. Only generated if extract_images is True

’page_images’, (Dict[int, PIL.Image.Image], optional): Image rendered for each page.

Only generated if generate_page_images is True.

}

Return type:

Dict[str, Any]