BurdocParser
BurdocParser provides the primary interface for extracting PDFs. It builds a processing chain, based on user configuration and extracts content.
- class burdoc.burdoc_parser.BurdocParser(detailed: bool = False, skip_ml_table_finding: bool = False, ignore_images: bool = False, max_threads: int | None = None, log_level: int = 20, show_pages: bool = False)
Top-level class to extract structured content from PDF.
Example Usage:
`python from burdoc import BurdocParser content = BurdocParser.read("file.pdf") `- __init__(detailed: bool = False, skip_ml_table_finding: bool = False, ignore_images: bool = False, max_threads: int | None = None, log_level: int = 20, show_pages: bool = False)
Instantiate a BurdocParser. Note that one of either html_out or json_out must be true
- Parameters:
detailed (bool) – Include detailed information such as font statistics and bounding boxes in the output
skip_ml_table_finding (bool) – Whether to use ML table finding algorithms.
ignore_images (bool) – Don’t extract any images from the document. Much faster but prone to errors if images used as layout elements.
max_threads (Optional[int], optional) – Maximum number of threads to run. Set to None to use default system limits or 1 to force single-threaded mode. Defaults to None.
log_level (int, optional) – Defaults to logging.INFO.
show_pages (bool, optional) – Draw each page as it’s extracted with extraction information laid on top. Primarily for debugging. Defaults to False.
- Raises:
ImportError – transformer library detected but loading transformer library failed.
- print_profile_info()
Print performance profile for last run
- read(path: str, pages: List[int] | None = None, extract_images: bool = True, extract_page_images: bool = False, extract_page_hierarchy: bool = True) Any
Read a PDF and output a structured response
- Parameters:
path (str) – Path of the pdf to load
pages (Optional[List[int]], optional) – List of pages to extract. Defaults to None.
extract_images – (bool): Extract images from PDF. This can cause the output to become extremely large. Default is False
extract_page_images – (bool): Extract the page images rendered as part of the processing. Default is False
extract_page_hierarchy – Extract a list of headings and titles. Default is False.
- Raises:
FileNotFoundError – If the file cannot be found.
EmptyFileError – If the file has zero length. Subclass of FileDataError and RuntimeError
ValueError – If unknown file type is specified. Subclass of RuntimeError
FileDataError – If the document has an invalid structure for the given type. Subclass of RuntimeError
- Returns:
Structured content, has format {
’metadata’ (Dict[str, Any]): Any metadata about the file itself ‘content’ (Dict[int, List[Any]]): Ordered content organised per-page ‘page_hierarchy (Dict[int, List[Any]]): Headers found in each page ‘images’, (Dict[int, List[PIL.Image.Image]], optional): Images extracted from
each page. Only generated if extract_images is True
- ’page_images’, (Dict[int, PIL.Image.Image], optional): Image rendered for each page.
Only generated if generate_page_images is True.
}
- Return type:
Dict[str, Any]