PDFLoadProcessor
Individual loaders for elements used as part of the pdf processor
PDFLoadProcessor
- class burdoc.processors.pdf_load_processor.pdf_load_processor.PDFLoadProcessor(log_level: int = 20, ignore_images: bool = False)
Loads PDF from file and extracts essential information with minor processing/cleaning applied
Requires: None Generates: ['page_bounds', 'text_elements', 'image_elements', 'drawing_elements', 'images', 'page_images']
- __init__(log_level: int = 20, ignore_images: bool = False)
Creates a PDF Load Processor
- Parameters:
log_level (int, optional) – Log level. Defaults to logging.INFO.
ignore_images (bool, optional) – Ignore images. This will greatly increase the speed but will likely cause issues if images are used for layout purposes, such as as section background or section breaks. Defaults to False.
- add_generated_items_to_fig(page_number: int, fig: Figure, data: Dict[str, Any])
Draw any items generated by this processor to a page image
- generates() List[str]
Return list of fields added by this processor
- merge_bullets_into_text(bullets: List[DrawingElement], text: List[LineElement])
Merge lone bullet points found as drawings into their closest text lines.
- Parameters:
bullets (List[DrawingElement]) –
text (List[LineElement]) –
- requirements() Tuple[List[str], List[str]]
Return list of required data fields and list of optional data fields
PDF Content Handlers
- class burdoc.processors.pdf_load_processor.drawing_handler.DrawingHandler(pdf: Document, log_level: int = 20)
Extracts drawings from a PDF and applies standardisation and basic type inference
- get_page_drawings(page: Page, page_colour: ndarray) Dict[DrawingType, List[DrawingElement]]
Extract all drawings from the page and apply basic classification
- Parameters:
page (fitz.Page) – THe page to extract drawings from
page_color (np.ndarray) – The primary background colour of the page
- Returns:
Drawings found, separated by type
- Return type:
Dict[DrawingType, List[DrawingElement]]
- class burdoc.processors.pdf_load_processor.image_handler.ImageHandler(pdf: Document, log_level: int = 20)
Extracts Images from a PDF, applies common preprocessing such as merging smasks and correcting inverted storage formats then classifies them according to their purpose within the document.
- get_image_elements(page: Page, page_image: Image, page_colour: ndarray) Tuple[Dict[ImageType, List[ImageElement]], List[Image]]
Extracts images from a PDF page.
- Parameters:
page (fitz.Page) – PDF Page to extract from
page_image (Image.Image) – An image of the page, used to identify the role of each image
page_colour (np.ndarray) – The primary background colour of the page
- Returns:
A list of each possible image type
- Return type:
Tuple[Dict[ImageType, List[ImageElement]], List[Image.Image]]
- class burdoc.processors.pdf_load_processor.text_handler.TextHandler(pdf: Document, log_level: int = 20)
Extracts text lines from a PDF then applies standardisation and filtering to them
- get_page_text(page: Page) List[LineElement]
Returns cleaned, standardised set of LineElements from a PDF page. Currently applies: - Duplicate detection - Bullet Point detection and assignment - Large starting character detection and assignment
- Parameters:
page (fitz.Page) –
- Returns:
List[LineElement]