Elements
Elements are (mostly) objects that have a physical location within a PDF page.
Elements generally inherit from LayoutElement to provide a consistent interface for accessing location information.
LayoutElement
- class burdoc.elements.element.LayoutElement(bbox: Bbox, title: str = 'LayoutElement')
Base class for any layout object within the PDF. LayoutElements can be used to describe anything that has a bbox.
- to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False) Dict[str, Any]
Convert the object into a JSON object
Example JSON:
{ "name": "LayoutElement", "bbox": {...} [optional] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the object.
- Return type:
Dict[str, Any]
LayoutElementGroup
- class burdoc.elements.element.LayoutElementGroup(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, title: str = 'LayoutElementGroup')
Base class for any coherent group of layout objects within the PDF. The BBox of the LayoutElementGroup is the rectangle encompassing all Bboxes of it’s members.
- append(item: LayoutElement, update_bbox: bool = True)
Add an item to the group
- Parameters:
item (LayoutElement) – Item to add
update_bbox (bool, optional) – Should the group Bbox be recalculated or ignored? Useful when items are non-contigous (e.g. they cross columns or pages). Defaults to True.
- merge(leg: LayoutElementGroup) LayoutElementGroup
In-place merge with another LayoutElementGroup
- Parameters:
leg (LayoutElementGroup) – LEG to merge with
- Returns:
A reference to self
- Return type:
- remove(item: LayoutElement, update_bbox: bool = True)
Remove an item from the group
- Parameters:
item (LayoutElement) – Item to remove
update_bbox (bool, optional) – Should the group Bbox be recalculated or ignored? Defaults to True.
- Raises:
ValueError – Item not present in list
- to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False) Dict[str, Any]
Creates JSON object from LayoutElementGroup
Example JSON:
{ "name": "LayoutElementGroup", "bbox": {...} [optional] "items": [{...}] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the group
- Return type:
Dict[str, Any]
Aside
- class burdoc.elements.aside.Aside(bbox: Bbox | None = None, items: List[LayoutElement] | None = None)
A small delimited section of text that is separate from the surrounding flow.
Bbox
- class burdoc.elements.bbox.Bbox(x0: float, y0: float, x1: float, y1: float, page_width: float, page_height: float)
Utility class for storing and manipulating bounding boxes.
- area() float
Returns total area of bbox
- Returns:
float
- area_norm() float
Returns total area of bbox as a percentage of the page area.
- Returns:
float
- center(norm: bool = False) Point
Returns a point representing the center of the bounding box.
- Parameters:
norm (bool, optional) – Return page-normalised co-ordinates.
False. (Defaults to) –
- Returns:
Point
- static from_points(p1: Point, p2: Point, page_width: float, page_height: float) Bbox
Create a Bbox spanning two points.
- height(norm: bool = False) float
Returns the height of the bounding box.
- Parameters:
norm (bool, optional) – Return page-normalised co-ordinates.
False. (Defaults to) –
- Returns:
float
- is_vertical() bool
Test if the Bounding box is oriented vertically. I.e. that the height is greater than the width.
- Returns:
_description_
- Return type:
bool
- static merge(bboxes: List[Bbox]) Bbox
Merge several Bboxes into a single bbox spanning all of them
- Parameters:
bboxes (List[Bbox]) –
- Raises:
ValueError – No bboxes passed
- Returns:
Bbox
- overlap(other_bbox: Bbox, normalisation: str = '') float
Calculates the overall overlap between this Bbox and another. Several normalisation options are provided:
“”: No normalisation “first”: Return as percent of calling bounding box area “second”: Return as percent of passed bounding box area “min”: Return as percent of smallest area “max”: Return as percent of largest area “page”: Return as percent of page area
- Parameters:
other_bbox (Bbox) – Passed bbox
normalisation (str, optional) – Normalisation option. Defaults to “”.
- Returns:
float
- to_json(include_page=False) Dict[str, float]
Convert a Bbox to JSON format.
Example:
{ 'x0':float, 'y0':float, 'x1':float, 'y1':float, 'pw':float [optional], 'ph':float [optional] }
- Parameters:
include_page (bool, optional) – Include page width and height. Defaults to False.
- Returns:
Dict[str, float]
- to_rect() List[float]
Returns a four point representation of the box.
- Returns:
[x0, y0, x1, y1]
- Return type:
List[float]
- width(norm: bool = False) float
Returns the width of the bounding box.
- Parameters:
norm (bool, optional) – Return page-normalised co-ordinates.
False. (Defaults to) –
- Returns:
float
- x0_norm() float
x0 normalised by its position on the page
- Returns:
float
- x1_norm() float
x1 normalised by its position on the page
- Returns:
float
- x_distance(other_bbox: Bbox) float
Returns the distance between called and passed Bbox in the x direction. Note that this is calculated centre to centre. It returns negatively if passed Bbox is below this Bbox.
- Parameters:
other_bbox (Bbox) –
- Returns:
float
- x_overlap(other_bbox: Bbox, normalisation: str = '') float
Calculates the projected overlap between this Bbox and another in the x axis. Several normalisation options are provided:
“”: No normalisation “first”: Return as percent of calling bounding box width “second”: Return as percent of passed bounding box width “min”: Return as percent of thinnest box width “max”: Return as percent of widest box width “page”: Return as percent of page width
- Parameters:
other_bbox (Bbox) – Passed bbox
normalisation (str, optional) – Normalisation option. Defaults to “”.
- Returns:
float
- y0_norm() float
y0 normalised by its position on the page
- Returns:
float
- y1_norm() float
y1 normalised by its position on the page
- Returns:
float
- y_distance(other_bbox: Bbox) float
Returns the distance between called and passed Bbox in the y direction. Note that this is calculated centre to centre. It returns negatively if passed Bbox is to the right of this Bbox.
- Parameters:
other_bbox (Bbox) –
- Returns:
float
- y_overlap(other_bbox: Bbox, normalisation: str = '') float
Calculates the projected overlap between this Bbox and another in the y axis. Several normalisation options are provided:
“”: No normalisation “first”: Return as percent of calling bounding box height “second”: Return as percent of passed bounding box height “min”: Return as percent of shortest box height “max”: Return as percent of tallest box height “page”: Return as percent of page height
- Parameters:
other_bbox (Bbox) – Passed bbox
normalisation (str, optional) – Normalisation option. Defaults to “”.
- Returns:
float
- class burdoc.elements.bbox.Point(x: 'float', y: 'float')
DrawingElement
- class burdoc.elements.drawing.DrawingElement(bbox: Bbox, drawing_type: DrawingType = DrawingType.UNKNOWN, fill_opacity: float = 0.0, fill_colour: ndarray | None = None, stroke_opacity: float = 0.0, stroke_colour: ndarray | None = None, stroke_width: float | None = None)
Core element representing a drawing
- __init__(bbox: Bbox, drawing_type: DrawingType = DrawingType.UNKNOWN, fill_opacity: float = 0.0, fill_colour: ndarray | None = None, stroke_opacity: float = 0.0, stroke_colour: ndarray | None = None, stroke_width: float | None = None)
Creates a drawing element.
- Parameters:
bbox (Bbox) – Bbox of the extent of the drawing
opacity (float) – Opacity of the drawing
drawing_type (DrawingType, optional) – Semantic purpose of the drawing. Default is UNKNOWN
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Convert the object into a JSON object
Example JSON:
{ "name": "LayoutElement", "bbox": {...} [optional] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the object.
- Return type:
Dict[str, Any]
- enum burdoc.elements.drawing.DrawingType(value)
Enumeration of types of drawing Burdoc understands.
LINE: Anything long and thin used as a visual separator
RECT: Usually means a square or outer edge defining an aside or section
TABLE: A collection of rectangles in a common table pattern
BULLET: A small circle indicating a textual bullet point.
UNKNOWN: An unknown drawing type
Valid values are as follows:
- LINE = <DrawingType.LINE: 1>
- RECT = <DrawingType.RECT: 2>
- BULLET = <DrawingType.BULLET: 3>
- TABLE = <DrawingType.TABLE: 4>
- UNKNOWN = <DrawingType.UNKNOWN: 5>
Font
- class burdoc.elements.font.Font(name: str, family: str, size: float, colour: int, bold: bool, italic: bool, superscript: bool, smallcaps: bool)
Representation of font information
- static from_dict(span_dict: Dict[str, Any])
Creates Font object from a PyMuPDF span
Some properties are inferred based on PyMuPDF flags, others are set dynamically from the font name
- Parameters:
font_doct (Dict[str, Any]) – _description_
- static split_font_name(fontname: str, type: str = '') Tuple[str, str]
Splits a font into family and base name (family-variation). Optional type argument only used when an unnamed font is found.
Consistently handles font subsetting and variations.
- Parameters:
fontname (str) – Full name of a font
type (str, optional) – Font type. Defaults to “”.
- Returns:
(font family, font basename)
- Return type:
Tuple[str, str]
- to_json()
Convert the Font into a JSON object
- Returns:
A JSON representation of the font.
- Return type:
Dict[str, Any]
ImageElement
- class burdoc.elements.image.ImageElement(bbox: Bbox, original_bbox: Bbox, image: int, properties: Dict[str, Any], image_type: ImageType = ImageType.UNKNOWN, inline: bool = False)
Core element representing an image with a page layout
- __init__(bbox: Bbox, original_bbox: Bbox, image: int, properties: Dict[str, Any], image_type: ImageType = ImageType.UNKNOWN, inline: bool = False)
Create an image element.
- Parameters:
bbox (Bbox) – A Bbox representing the image’s visible extent
original_bbox (Bbox) – A Bbox representing the image’s true extent
image (int) – Index of page image store where image is found
properties (Dict[str, Any]) – Any additional properties of the image
image_type (ImageType, optional) – Purpose of the image. Default is UNKNOWN
inline (bool, optional) – Whether the image layout should be inline or additional. Generally set later in processing. Defaults to False.
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Convert the object into a JSON object
Example JSON:
{ "name": "LayoutElement", "bbox": {...} [optional] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the object.
- Return type:
Dict[str, Any]
- enum burdoc.elements.image.ImageType(value)
Enumeration of types of images Burdoc understands.
INVISIBLE: Image isn’t visible on page
BACKGROUND: Image is used as background for the whole page
SECTION: Image is used as a background for a page section or aside
INLINE: Image is part of the flow of text (currently unused)
DECORATIVE: Image is a decorative element in the page layout but has no semantic meaning
PRIMARY: Image is a ‘hero’ image on the page
GRADIENT: Image is a smooth gradient used as a background
LINE: Image is used to semantically separate page sections
Valid values are as follows:
- INVISIBLE = <ImageType.INVISIBLE: 1>
- BACKGROUND = <ImageType.BACKGROUND: 2>
- SECTION = <ImageType.SECTION: 3>
- INLINE = <ImageType.INLINE: 4>
- DECORATIVE = <ImageType.DECORATIVE: 5>
- PRIMARY = <ImageType.PRIMARY: 6>
- GRADIENT = <ImageType.GRADIENT: 7>
- LINE = <ImageType.LINE: 8>
- UNKNOWN = <ImageType.UNKNOWN: 9>
LineElement
- class burdoc.elements.line.LineElement(bbox: Bbox, spans: List[Span], rotation: Tuple[float, float])
Core element representing a line of text
- static from_dict(line_dict: Dict[str, Any], page_width: float, page_height: float) LineElement
Create a LineElement from a PyMuPDF line dictionary
- Parameters:
line_dict (Dict[str, Any]) – The PyMuPDF dictionary
page_width (float) – Used to normalise bbox
page_height (float) – Used to normalise bbox
- Returns:
LineElement
- get_text() str
Returns all text contained within the line as a string. This strips out any format or font information.
- Returns:
str
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Convert the object into a JSON object
Example JSON:
{ "name": "LayoutElement", "bbox": {...} [optional] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the object.
- Return type:
Dict[str, Any]
PageSection
- class burdoc.elements.section.PageSection(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, default: bool = False, backing_drawing: DrawingElement | None = None, backing_image: ImageElement | None = None, inline: bool = False)
A fully contained section of the page on which layout analysis should be done independently.
- __init__(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, default: bool = False, backing_drawing: DrawingElement | None = None, backing_image: ImageElement | None = None, inline: bool = False)
Create a PageSection. One of bbox or items must be provided
- Parameters:
bbox (Optional[Bbox], optional) – BBox of the section. Defaults to None.
items (Optional[List[LayoutElement]], optional) – Items contained within the section. Defaults to None.
default (bool, optional) – Is this part of the underlying page or a subsection. Defaults to False.
backing_drawing (Optional[Any], optional) – Drawing used as the background for this section only. Defaults to None.
backing_image (Optional[Any], optional) – Image used as the background for this section only. Defaults to None.
inline (bool, optional) – Is this section inline with surrounding text. Usually inferred later in the pipeline. Defaults to False.
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Creates JSON object from LayoutElementGroup
Example JSON:
{ "name": "LayoutElementGroup", "bbox": {...} [optional] "items": [{...}] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the group
- Return type:
Dict[str, Any]
Span
- class burdoc.elements.span.Span(bbox: Bbox, text: str, font: Font)
Representation of a continuous run of text with the same font information.
- static from_dict(span_dict: Dict[str, Any], page_width: float, page_height: float)
Creates a Span from a PyMuPDF spac dictionary
- Parameters:
span_dict (Dict[str, Any]) – The PyMuPDF span dictionary
page_width (float) – Used to normalise bbox
page_height (float) – Used to normalise bbox
- Returns:
Span
- to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False, **kwargs)
Convert the object into a JSON object
Example JSON:
{ "name": "LayoutElement", "bbox": {...} [optional] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the object.
- Return type:
Dict[str, Any]
Table
- class burdoc.elements.table.Table(bbox: Bbox, row_boxes: List[Tuple[TableParts, Bbox]], col_boxes: List[Tuple[TableParts, Bbox]], merge_boxes: List[Tuple[TableParts, Bbox]])
Representation of a table within the text.
- __init__(bbox: Bbox, row_boxes: List[Tuple[TableParts, Bbox]], col_boxes: List[Tuple[TableParts, Bbox]], merge_boxes: List[Tuple[TableParts, Bbox]])
Creates a Table element
- Parameters:
bbox (Bbox) – Bounding box of the the table
row_boxes (List[Tuple[TableParts, Bbox]]) – Bounding box and descriptor of each row - use TableParts.COLUMNHEADER to indicate a row used as a header
col_boxes (List[Tuple[TableParts, Bbox]]) – Bounding box and descriptor of each column - use TableParts.ROWHEADER to indicate a column used as a header
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Convert the object into a JSON object
Example JSON:
{ "name": "LayoutElement", "bbox": {...} [optional] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the object.
- Return type:
Dict[str, Any]
- enum burdoc.elements.table.TableParts(value)
Enum defining the different parts of a table that can be extracted
Valid values are as follows:
- TABLE = <TableParts.TABLE: 0>
- COLUMN = <TableParts.COLUMN: 1>
- ROW = <TableParts.ROW: 2>
- COLUMNHEADER = <TableParts.COLUMNHEADER: 3>
- ROWHEADER = <TableParts.ROWHEADER: 4>
- SPANNINGCELL = <TableParts.SPANNINGCELL: 5>
TextBlock
- class burdoc.elements.textblock.TextBlock(bbox: Bbox | None = None, items: List[LineElement] | None = None, text_type: TextBlockType = TextBlockType.PARAGRAPH)
Represents a standard grouping of lines into a paragraph. All text within a textblock can be considered to be of semantically equivalent fonts. This may include variations in bold or italics.
- get_text() str
Returns all text contained within the block as a string This strips out any format or font information.
- Returns:
str
- to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False, **kwargs)
Convert the textblock into a JSON object
- Parameters:
extras (Dict[str, Any], optional) – Any additional fields that should be included. Defaults to None
include_bbox (bool, optional) – Defaults to False.
**kwargs – Arbitrary keyword arguments to be pass to superclass
- Returns:
Dict[str, Any]
- enum burdoc.elements.textblock.TextBlockType(value)
Possible types of text supported by the semantic classifier.
Valid values are as follows:
- SMALL = <TextBlockType.SMALL: 1>
- PARAGRAPH = <TextBlockType.PARAGRAPH: 2>
- H1 = <TextBlockType.H1: 3>
- H2 = <TextBlockType.H2: 4>
- H3 = <TextBlockType.H3: 5>
- H4 = <TextBlockType.H4: 6>
- H5 = <TextBlockType.H5: 7>
- H6 = <TextBlockType.H6: 8>
- EMPHASIS = <TextBlockType.EMPHASIS: 9>
TextList
- class burdoc.elements.textlist.TextList(ordered: bool, bbox: Bbox | None = None, items: List[LayoutElement] | None = None)
An ordered or unordered list
- __init__(ordered: bool, bbox: Bbox | None = None, items: List[LayoutElement] | None = None)
Create a text list. Must provide one of bbox or items. If items are provided the bbox will be inferred.
- Parameters:
ordered (bool) – Is the list ordered (alphanumeric) or unordered (bullets)
bbox (Optional[Bbox], optional) – Bbox containing the list. Defaults to None.
items (Optional[List[LayoutElement]], optional) – Items making up the list. Defaults to None.
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Creates JSON object from LayoutElementGroup
Example JSON:
{ "name": "LayoutElementGroup", "bbox": {...} [optional] "items": [{...}] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the group
- Return type:
Dict[str, Any]
- class burdoc.elements.textlist.TextListItem(label: str, items: List[TextBlock])
A single item within a list. Equivalent to <li>
- __init__(label: str, items: List[TextBlock])
Create a text list item
- Parameters:
label (str) – The label of the list item. Can be bullet or alphanumeric
items (List[TextBlock]) – The content of the list item
- to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)
Creates JSON object from LayoutElementGroup
Example JSON:
{ "name": "LayoutElementGroup", "bbox": {...} [optional] "items": [{...}] }
- Parameters:
extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.
- Returns:
A JSON representation of the group
- Return type:
Dict[str, Any]