Utils

Common functions and utilities used across the burdoc package

Output Comparison

A comparison function for generating a smart diff from extracted content

burdoc.utils.compare.compare(obj1: Dict[str, Any], obj2: Dict[str, Any], ignore_paths: List[str] | None = None) List[Dict[str, Any]]

Compares to JSON objects generated by burdoc and returns a list of changes. Unlike most dictionary comparison systems, this detects re-orderings as well as changes

Parameters:
  • obj1 (Dict[str, Any]) – A JSON output from Burdoc.

  • obj2 (Dict[str, Any]) – A JSON output from Burdoc.

  • ignore_paths (List[str]) – Any paths to ignore changes. Use this to exclude unstable fields such as file paths.

Returns:

A list of changes in format:

[
    {
        'path':path to change in object,
        'type':[change, addition, deletion, reorder],
        'old':old value,
        'new':new value,
        'value':value of the object (only used for reorder)
    }
]

Return type:

List[Dict[str, Any]]

Image Manipulation

Utility functions for manipulating/analysing images

burdoc.utils.image_manip.get_image_palette(image: Image, n_colours: int, n_means: int = 5) List[Tuple[List[float], Any]]

Get the top n most representative colours from an image

This blurs the image to remove noise, then performans a K-means clustering over pixel values.

Parameters:
  • image (Image) – A PIL Image

  • n_colours (int) – Number of colours to extract

  • n_means (int, optional) – Number of means to use. Increasing this results in more accurate results

  • 5. (in busy images but less accurate in ones with only a small number of colours. Defaults to) –

Returns:

Triples of the colour extracted and the percent of pixels close to that colour.

Return type:

List[Tuple[List[float], Any]]

Layout Graph

The LayoutGraph efficiently builds a modified adjacency graph of elements. Used by various parts of the processing pipeline.

class burdoc.utils.layout_graph.LayoutGraph(pagebound: Bbox, elements: Sequence[LayoutElement])

LayoutGraph attempts to efficiently build a modified adjacency graph over the passed elements.

Each node in the graph is labelled with all ‘adjacent’ nodes in each of the cardinal directions, ordered by edge-to-edge distance. Here adjacency means that no horizontal or vertical line drawn between opposing edges of the boxes intersects and other box.

[   a   ]    [   b   ]
[ c ]
[         d          ]

Under this diagram the adjacency relationships are (a,right,b), (a,down,c), (c,down,d), and (b,down,d) but not (a,down,d). Note that adjacency is symettric, so (a,right,b) imports (b,left,a) and so on.

class Node(node_id: int, element: LayoutElement)

Container for graph node, storing it’s adjacent nodes and the original element

down: List[Tuple[int, float]]

All down adjacent nodes, sorted closest to furthest

left: List[Tuple[int, float]]

All left adjacent nodes, sorted closest to furthest

right: List[Tuple[int, float]]

All right adjacent nodes, sorted closest to furthest

up: List[Tuple[int, float]]

All up adjacent nodes, sorted closest to furthest

__init__(pagebound: Bbox, elements: Sequence[LayoutElement])

Create a LayoutGraph from the passed elements.

Parameters:
  • pagebound (Bbox) – Bounding box of the containing page or section

  • elements (Sequence[LayoutElement]) – Sequence of elements to build the layout adjacency graph

get_node(id_or_id_dist_pair: int | Tuple[int, float]) Node

Retrieves a graph node from it’s Id, or an (Id, distance) tuple.

Parameters:
  • id_or_id_dist_pair (Union[int, Tuple[int, float]]) – Node Id or the (Id, distance) tuple

  • adjacencies (used for storing node) –

Raises:

IndexError – Id does not exist in the graph

Returns:

The requested node

Return type:

Node

node_has_ancestor(node_id: int, target_id: int) bool

Check whether the target node is an ‘ancestor’ of the primary node. Here ‘ancestor’ means that there is a leftwards or upwards adjacency relations that get from the node to the target.

Parameters:
  • node_id (int) – Starting node

  • target_id (int) – Node to check if in ancestry

Returns:

Target node is ancester of starting node

Return type:

bool

Logging

Utility function for retrieving a tt_logger that can manage across threads

burdoc.utils.logging.get_logger(name: str, log_path: str = '.burdoc.log', log_level: int = 20)

Retrieve a threadsafe logger.

Parameters:
  • name (str) – Name of the logger

  • log_path (str, optional) – Write path for the log file. Defaults to “.burdoc.log”.

  • log_level (int, optional) – Log level. Defaults to logging.INFO.

Returns:

_description_

Return type:

_type_

Render Pages

Utility functions for drawing a rendered page image and overlaying extracted elements

burdoc.utils.render_pages.add_rect_to_figure(fig: Figure, bbox: Bbox, colour: str)

Add a rectangle to the passed figure

Parameters:
  • fig (Figure) – A plotly figure

  • bbox (Bbox) – Bbox of rectangle to draw

  • colour (str) – Line colour

burdoc.utils.render_pages.add_text_to_figure(fig: Figure, point: Point, colour: str, text: str, text_size: float = 20)

Add text to the passed figure

Parameters:
  • fig (Figure) – A plotly figure

  • point (Point) – Top left coordinates of the text

  • colour (str) – Text colour

burdoc.utils.render_pages.render_pages(data: Dict[str, Any], processors: List[Processor], pages: List[int] | None = None)

Render an image of the page to the screen and apply the draw functions of all pass processors

Parameters:
  • data (Dict[str, Any]) – Extracted content

  • processors (List[Processor]) – Processors used to overlay extraction elements

  • pages (Optional[List[int]], optional) – Pages to draw. Will draw all if None. Defaults to None.