Table Processors
Rules-based Processors
- class burdoc.processors.table_processors.rules_table_processor.RulesTableProcessor(log_level: int = 20)
Applies a simple rules-based algorithm to identify tables in text. This looks for patterns in text blocks and makes no use of lines/images. Very good at pulling out dense inline tables missed by the ML algorithms.
Requires: [‘page_bounds’, ‘elements’] Optional: [] Generates: [‘tables’, ‘elements’]
- add_generated_items_to_fig(page_number: int, fig: Figure, data: Dict[str, Any])
Draw any items generated by this processor to a page image
- generates() List[str]
Return list of fields added by this processor
- requirements() Tuple[List[str], List[str]]
Return list of required data fields and list of optional data fields
ML-based Processors
- class burdoc.processors.table_processors.ml_table_processor.MLTableProcessor(strategy: Strategies = Strategies.DETR, log_level: int = 20)
Wrapper for ML models to detect tables. Separated from rules based processor as it can only be run single-threaded.
Requires: [‘text_elements’] and additional requirements from specific strategy Optional: [] Generates: [‘tables’, ‘text_elements’]
- enum Strategies(value)
List of possible ML table finding strategies
Currently implemented: * DETR: DETR Using Microsoft Table Transformers gi
Valid values are as follows:
- DETR = <Strategies.DETR: 1>
- add_generated_items_to_fig(page_number: int, fig: Figure, data: Dict[str, Any])
Draw any items generated by this processor to a page image
- generates() List[str]
Return list of fields added by this processor
- initialise()
Perform any expensive operations required to create a processor
- requirements() Tuple[List[str], List[str]]
Return list of required data fields and list of optional data fields
- class burdoc.processors.table_processors.detr_table_strategy.DetrTableStrategy(log_level: int = 20)
Use Microsofts table-transformer to identify tables
microsoft/table-transformer-detection used for finding tables
microsoft/table-transformer-structure-recognition used for identifying table parts
- extract_tables(page_numbers: List[int], page_images: Dict[int, Image]) Dict[int, List[List[Tuple[TableParts, Bbox]]]]
Identifies tables within a page image and for each table returns a list of table parts. If a GPU is used, pages are batched together to improve efficiency
Returns:
{ page_index (int): [ [(TableParts, Bbox) for a table] for each table ] }
- static requirements() List[str]
Return list of data requirements for this strategy
Strategy Interface Class
- class burdoc.processors.table_processors.table_extractor_strategy.TableExtractorStrategy(name: str, log_level: int = 20)
Abstract base class defining the interface for table extraction methods. This is consistent between ML and rules based methods
- abstract extract_tables(page_numbers: List[int], page_images: Dict[int, Image]) Dict[int, List[List[Tuple[TableParts, Bbox]]]]
Extracts tables and returns them in a complex JSON format
- abstract static requirements() List[str]
Return list of data requirements for this strategy