PDFLoadProcessor

Individual loaders for elements used as part of the pdf processor

PDFLoadProcessor 

class burdoc.processors.pdf_load_processor.pdf_load_processor.PDFLoadProcessor(log_level: int = 20, ignore_images: bool = False)

Loads PDF from file and extracts essential information with minor processing/cleaning applied

Requires: None
Generates: ['page_bounds', 'text_elements', 'image_elements', 'drawing_elements', 'images', 'page_images']

__init__(log_level: int = 20, ignore_images: bool = False)

Creates a PDF Load Processor

Parameters:

log_level (int, optional) – Log level. Defaults to logging.INFO.
ignore_images (bool, optional) – Ignore images. This will greatly increase the speed but will likely cause issues if images are used for layout purposes, such as as section background or section breaks. Defaults to False.

add_generated_items_to_fig(page_number: int, fig: Figure, data: Dict[str, Any]): Draw any items generated by this processor to a page image

generates() → List[str]: Return list of fields added by this processor

merge_bullets_into_text(bullets: List[DrawingElement], text: List[LineElement])

Merge lone bullet points found as drawings into their closest text lines.

Parameters:

bullets (List[DrawingElement]) –
text (List[LineElement]) –

requirements() → Tuple[List[str], List[str]]: Return list of required data fields and list of optional data fields

PDF Content Handlers 

class burdoc.processors.pdf_load_processor.drawing_handler.DrawingHandler(pdf: Document, log_level: int = 20)

Extracts drawings from a PDF and applies standardisation and basic type inference

get_page_drawings(page: Page, page_colour: ndarray) → Dict[DrawingType, List[DrawingElement]]

Extract all drawings from the page and apply basic classification

Parameters:

page (fitz.Page) – THe page to extract drawings from
page_color (np.ndarray) – The primary background colour of the page

Returns:

Drawings found, separated by type

Return type:

Dict[DrawingType, List[DrawingElement]]

class burdoc.processors.pdf_load_processor.image_handler.ImageHandler(pdf: Document, log_level: int = 20)

Extracts Images from a PDF, applies common preprocessing such as merging smasks and correcting inverted storage formats then classifies them according to their purpose within the document.

get_image_elements(page: Page, page_image: Image, page_colour: ndarray) → Tuple[Dict[ImageType, List[ImageElement]], List[Image]]

Extracts images from a PDF page.

Parameters:

page (fitz.Page) – PDF Page to extract from
page_image (Image.Image) – An image of the page, used to identify the role of each image
page_colour (np.ndarray) – The primary background colour of the page

Returns:

A list of each possible image type

Return type:

Tuple[Dict[ImageType, List[ImageElement]], List[Image.Image]]

class burdoc.processors.pdf_load_processor.text_handler.TextHandler(pdf: Document, log_level: int = 20)

Extracts text lines from a PDF then applies standardisation and filtering to them

get_page_text(page: Page) → List[LineElement]

Returns cleaned, standardised set of LineElements from a PDF page. Currently applies: - Duplicate detection - Bullet Point detection and assignment - Large starting character detection and assignment

Parameters:: page (fitz.Page) –
Returns:: List[LineElement]

PDFLoadProcessor