Table Processors

Rules-based Processors

class burdoc.processors.table_processors.rules_table_processor.RulesTableProcessor(log_level: int = 20)

Applies a simple rules-based algorithm to identify tables in text. This looks for patterns in text blocks and makes no use of lines/images. Very good at pulling out dense inline tables missed by the ML algorithms.

Requires: [‘page_bounds’, ‘elements’] Optional: [] Generates: [‘tables’, ‘elements’]

add_generated_items_to_fig(page_number: int, fig: Figure, data: Dict[str, Any])

Draw any items generated by this processor to a page image

generates() List[str]

Return list of fields added by this processor

requirements() Tuple[List[str], List[str]]

Return list of required data fields and list of optional data fields

ML-based Processors

class burdoc.processors.table_processors.ml_table_processor.MLTableProcessor(strategy: Strategies = Strategies.DETR, log_level: int = 20)

Wrapper for ML models to detect tables. Separated from rules based processor as it can only be run single-threaded.

Requires: [‘text_elements’] and additional requirements from specific strategy Optional: [] Generates: [‘tables’, ‘text_elements’]

enum Strategies(value)

List of possible ML table finding strategies

Currently implemented: * DETR: DETR Using Microsoft Table Transformers gi

Valid values are as follows:

DETR = <Strategies.DETR: 1>
add_generated_items_to_fig(page_number: int, fig: Figure, data: Dict[str, Any])

Draw any items generated by this processor to a page image

generates() List[str]

Return list of fields added by this processor

initialise()

Perform any expensive operations required to create a processor

requirements() Tuple[List[str], List[str]]

Return list of required data fields and list of optional data fields

class burdoc.processors.table_processors.detr_table_strategy.DetrTableStrategy(log_level: int = 20)

Use Microsofts table-transformer to identify tables

extract_tables(page_numbers: List[int], page_images: Dict[int, Image]) Dict[int, List[List[Tuple[TableParts, Bbox]]]]

Identifies tables within a page image and for each table returns a list of table parts. If a GPU is used, pages are batched together to improve efficiency

Returns:

{
    page_index (int): [
        [(TableParts, Bbox) for a table] for each table
    ]
}
static requirements() List[str]

Return list of data requirements for this strategy

Strategy Interface Class

class burdoc.processors.table_processors.table_extractor_strategy.TableExtractorStrategy(name: str, log_level: int = 20)

Abstract base class defining the interface for table extraction methods. This is consistent between ML and rules based methods

abstract extract_tables(page_numbers: List[int], page_images: Dict[int, Image]) Dict[int, List[List[Tuple[TableParts, Bbox]]]]

Extracts tables and returns them in a complex JSON format

abstract static requirements() List[str]

Return list of data requirements for this strategy