Utils
Common functions and utilities used across the burdoc package
Output Comparison
A comparison function for generating a smart diff from extracted content
- burdoc.utils.compare.compare(obj1: Dict[str, Any], obj2: Dict[str, Any], ignore_paths: List[str] | None = None) List[Dict[str, Any]]
Compares to JSON objects generated by burdoc and returns a list of changes. Unlike most dictionary comparison systems, this detects re-orderings as well as changes
- Parameters:
obj1 (Dict[str, Any]) – A JSON output from Burdoc.
obj2 (Dict[str, Any]) – A JSON output from Burdoc.
ignore_paths (List[str]) – Any paths to ignore changes. Use this to exclude unstable fields such as file paths.
- Returns:
A list of changes in format:
[ { 'path':path to change in object, 'type':[change, addition, deletion, reorder], 'old':old value, 'new':new value, 'value':value of the object (only used for reorder) } ]
- Return type:
List[Dict[str, Any]]
Image Manipulation
Utility functions for manipulating/analysing images
- burdoc.utils.image_manip.get_image_palette(image: Image, n_colours: int, n_means: int = 5) List[Tuple[List[float], Any]]
Get the top n most representative colours from an image
This blurs the image to remove noise, then performans a K-means clustering over pixel values.
- Parameters:
image (Image) – A PIL Image
n_colours (int) – Number of colours to extract
n_means (int, optional) – Number of means to use. Increasing this results in more accurate results
5. (in busy images but less accurate in ones with only a small number of colours. Defaults to) –
- Returns:
Triples of the colour extracted and the percent of pixels close to that colour.
- Return type:
List[Tuple[List[float], Any]]
Layout Graph
The LayoutGraph efficiently builds a modified adjacency graph of elements. Used by various parts of the processing pipeline.
- class burdoc.utils.layout_graph.LayoutGraph(pagebound: Bbox, elements: Sequence[LayoutElement])
LayoutGraph attempts to efficiently build a modified adjacency graph over the passed elements.
Each node in the graph is labelled with all ‘adjacent’ nodes in each of the cardinal directions, ordered by edge-to-edge distance. Here adjacency means that no horizontal or vertical line drawn between opposing edges of the boxes intersects and other box.
[ a ] [ b ] [ c ] [ d ]
Under this diagram the adjacency relationships are (a,right,b), (a,down,c), (c,down,d), and (b,down,d) but not (a,down,d). Note that adjacency is symettric, so (a,right,b) imports (b,left,a) and so on.
- class Node(node_id: int, element: LayoutElement)
Container for graph node, storing it’s adjacent nodes and the original element
- down: List[Tuple[int, float]]
All down adjacent nodes, sorted closest to furthest
- left: List[Tuple[int, float]]
All left adjacent nodes, sorted closest to furthest
- right: List[Tuple[int, float]]
All right adjacent nodes, sorted closest to furthest
- up: List[Tuple[int, float]]
All up adjacent nodes, sorted closest to furthest
- __init__(pagebound: Bbox, elements: Sequence[LayoutElement])
Create a LayoutGraph from the passed elements.
- Parameters:
pagebound (Bbox) – Bounding box of the containing page or section
elements (Sequence[LayoutElement]) – Sequence of elements to build the layout adjacency graph
- get_node(id_or_id_dist_pair: int | Tuple[int, float]) Node
Retrieves a graph node from it’s Id, or an (Id, distance) tuple.
- Parameters:
id_or_id_dist_pair (Union[int, Tuple[int, float]]) – Node Id or the (Id, distance) tuple
adjacencies (used for storing node) –
- Raises:
IndexError – Id does not exist in the graph
- Returns:
The requested node
- Return type:
- node_has_ancestor(node_id: int, target_id: int) bool
Check whether the target node is an ‘ancestor’ of the primary node. Here ‘ancestor’ means that there is a leftwards or upwards adjacency relations that get from the node to the target.
- Parameters:
node_id (int) – Starting node
target_id (int) – Node to check if in ancestry
- Returns:
Target node is ancester of starting node
- Return type:
bool
Logging
Utility function for retrieving a tt_logger that can manage across threads
- burdoc.utils.logging.get_logger(name: str, log_path: str = '.burdoc.log', log_level: int = 20)
Retrieve a threadsafe logger.
- Parameters:
name (str) – Name of the logger
log_path (str, optional) – Write path for the log file. Defaults to “.burdoc.log”.
log_level (int, optional) – Log level. Defaults to logging.INFO.
- Returns:
_description_
- Return type:
_type_
Render Pages
Utility functions for drawing a rendered page image and overlaying extracted elements
- burdoc.utils.render_pages.add_rect_to_figure(fig: Figure, bbox: Bbox, colour: str)
Add a rectangle to the passed figure
- Parameters:
fig (Figure) – A plotly figure
bbox (Bbox) – Bbox of rectangle to draw
colour (str) – Line colour
- burdoc.utils.render_pages.add_text_to_figure(fig: Figure, point: Point, colour: str, text: str, text_size: float = 20)
Add text to the passed figure
- Parameters:
fig (Figure) – A plotly figure
point (Point) – Top left coordinates of the text
colour (str) – Text colour
- burdoc.utils.render_pages.render_pages(data: Dict[str, Any], processors: List[Processor], pages: List[int] | None = None)
Render an image of the page to the screen and apply the draw functions of all pass processors
- Parameters:
data (Dict[str, Any]) – Extracted content
processors (List[Processor]) – Processors used to overlay extraction elements
pages (Optional[List[int]], optional) – Pages to draw. Will draw all if None. Defaults to None.