Elements

Elements are (mostly) objects that have a physical location within a PDF page.

Elements generally inherit from LayoutElement to provide a consistent interface for accessing location information.

LayoutElement 

class burdoc.elements.element.LayoutElement(bbox: Bbox, title: str = 'LayoutElement')

Base class for any layout object within the PDF. LayoutElements can be used to describe anything that has a bbox.

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False) → Dict[str, Any]

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

LayoutElementGroup 

class burdoc.elements.element.LayoutElementGroup(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, title: str = 'LayoutElementGroup')

Base class for any coherent group of layout objects within the PDF. The BBox of the LayoutElementGroup is the rectangle encompassing all Bboxes of it’s members.

append(item: LayoutElement, update_bbox: bool = True)

Add an item to the group

Parameters:

item (LayoutElement) – Item to add
update_bbox (bool, optional) – Should the group Bbox be recalculated or ignored? Useful when items are non-contigous (e.g. they cross columns or pages). Defaults to True.

merge(leg: LayoutElementGroup) → LayoutElementGroup

In-place merge with another LayoutElementGroup

Parameters:: leg (LayoutElementGroup) – LEG to merge with
Returns:: A reference to self
Return type:: LayoutElementGroup

remove(item: LayoutElement, update_bbox: bool = True)

Remove an item from the group

Parameters:

item (LayoutElement) – Item to remove
update_bbox (bool, optional) – Should the group Bbox be recalculated or ignored? Defaults to True.

Raises:

ValueError – Item not present in list

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False) → Dict[str, Any]

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]

Aside 

class burdoc.elements.aside.Aside(bbox: Bbox | None = None, items: List[LayoutElement] | None = None): A small delimited section of text that is separate from the surrounding flow.

Bbox 

class burdoc.elements.bbox.Bbox(x0: float, y0: float, x1: float, y1: float, page_width: float, page_height: float)

Utility class for storing and manipulating bounding boxes.

area() → float

Returns total area of bbox

Returns:: float

area_norm() → float

Returns total area of bbox as a percentage of the page area.

Returns:: float

center(norm: bool = False) → Point

Returns a point representing the center of the bounding box.

Parameters:

norm (bool, optional) – Return page-normalised co-ordinates.
False. (Defaults to) –

Returns:

Point

clone() → Bbox

Returns a clone of the bounding box

Returns:: Bbox

static from_points(p1: Point, p2: Point, page_width: float, page_height: float) → Bbox

Create a Bbox spanning two points.

Parameters:

p1 (Point) – Top-left corner
p2 (Point) – Bottom-right corner
page_width (float) – Width of page
page_height (float) – Height of page

Returns:

Bbox

height(norm: bool = False) → float

Returns the height of the bounding box.

Parameters:

norm (bool, optional) – Return page-normalised co-ordinates.
False. (Defaults to) –

Returns:

float

is_vertical() → bool

Test if the Bounding box is oriented vertically. I.e. that the height is greater than the width.

Returns:: _description_
Return type:: bool

static merge(bboxes: List[Bbox]) → Bbox

Merge several Bboxes into a single bbox spanning all of them

Parameters:: bboxes (List[Bbox]) –
Raises:: ValueError – No bboxes passed
Returns:: Bbox

overlap(other_bbox: Bbox, normalisation: str = '') → float

Calculates the overall overlap between this Bbox and another. Several normalisation options are provided:

“”: No normalisation “first”: Return as percent of calling bounding box area “second”: Return as percent of passed bounding box area “min”: Return as percent of smallest area “max”: Return as percent of largest area “page”: Return as percent of page area

Parameters:

other_bbox (Bbox) – Passed bbox
normalisation (str, optional) – Normalisation option. Defaults to “”.

Returns:

float

to_json(include_page=False) → Dict[str, float]

Convert a Bbox to JSON format.

Example:

{
    'x0':float,
    'y0':float,
    'x1':float,
    'y1':float,
    'pw':float [optional],
    'ph':float [optional]
}

Parameters:: include_page (bool, optional) – Include page width and height. Defaults to False.
Returns:: Dict[str, float]

to_rect() → List[float]

Returns a four point representation of the box.

Returns:: [x0, y0, x1, y1]
Return type:: List[float]

width(norm: bool = False) → float

Returns the width of the bounding box.

Parameters:

norm (bool, optional) – Return page-normalised co-ordinates.
False. (Defaults to) –

Returns:

float

x0_norm() → float

x0 normalised by its position on the page

Returns:: float

x1_norm() → float

x1 normalised by its position on the page

Returns:: float

x_distance(other_bbox: Bbox) → float

Returns the distance between called and passed Bbox in the x direction. Note that this is calculated centre to centre. It returns negatively if passed Bbox is below this Bbox.

Parameters:: other_bbox (Bbox) –
Returns:: float

x_overlap(other_bbox: Bbox, normalisation: str = '') → float

Calculates the projected overlap between this Bbox and another in the x axis. Several normalisation options are provided:

“”: No normalisation “first”: Return as percent of calling bounding box width “second”: Return as percent of passed bounding box width “min”: Return as percent of thinnest box width “max”: Return as percent of widest box width “page”: Return as percent of page width

Parameters:

other_bbox (Bbox) – Passed bbox
normalisation (str, optional) – Normalisation option. Defaults to “”.

Returns:

float

y0_norm() → float

y0 normalised by its position on the page

Returns:: float

y1_norm() → float

y1 normalised by its position on the page

Returns:: float

y_distance(other_bbox: Bbox) → float

Returns the distance between called and passed Bbox in the y direction. Note that this is calculated centre to centre. It returns negatively if passed Bbox is to the right of this Bbox.

Parameters:: other_bbox (Bbox) –
Returns:: float

y_overlap(other_bbox: Bbox, normalisation: str = '') → float

Calculates the projected overlap between this Bbox and another in the y axis. Several normalisation options are provided:

“”: No normalisation “first”: Return as percent of calling bounding box height “second”: Return as percent of passed bounding box height “min”: Return as percent of shortest box height “max”: Return as percent of tallest box height “page”: Return as percent of page height

Parameters:

other_bbox (Bbox) – Passed bbox
normalisation (str, optional) – Normalisation option. Defaults to “”.

Returns:

float

class burdoc.elements.bbox.Point(x: 'float', y: 'float')

DrawingElement 

class burdoc.elements.drawing.DrawingElement(bbox: Bbox, drawing_type: DrawingType = DrawingType.UNKNOWN, fill_opacity: float = 0.0, fill_colour: ndarray | None = None, stroke_opacity: float = 0.0, stroke_colour: ndarray | None = None, stroke_width: float | None = None)

Core element representing a drawing

__init__(bbox: Bbox, drawing_type: DrawingType = DrawingType.UNKNOWN, fill_opacity: float = 0.0, fill_colour: ndarray | None = None, stroke_opacity: float = 0.0, stroke_colour: ndarray | None = None, stroke_width: float | None = None)

Creates a drawing element.

Parameters:

bbox (Bbox) – Bbox of the extent of the drawing
opacity (float) – Opacity of the drawing
drawing_type (DrawingType, optional) – Semantic purpose of the drawing. Default is UNKNOWN

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

enum burdoc.elements.drawing.DrawingType(value)

Enumeration of types of drawing Burdoc understands.

LINE: Anything long and thin used as a visual separator
RECT: Usually means a square or outer edge defining an aside or section
TABLE: A collection of rectangles in a common table pattern
BULLET: A small circle indicating a textual bullet point.
UNKNOWN: An unknown drawing type

Valid values are as follows:

LINE = <DrawingType.LINE: 1>

RECT = <DrawingType.RECT: 2>

BULLET = <DrawingType.BULLET: 3>

TABLE = <DrawingType.TABLE: 4>

UNKNOWN = <DrawingType.UNKNOWN: 5>

Font 

class burdoc.elements.font.Font(name: str, family: str, size: float, colour: int, bold: bool, italic: bool, superscript: bool, smallcaps: bool)

Representation of font information

static from_dict(span_dict: Dict[str, Any])

Creates Font object from a PyMuPDF span

Some properties are inferred based on PyMuPDF flags, others are set dynamically from the font name

Parameters:: font_doct (Dict[str, Any]) – _description_

static split_font_name(fontname: str, type: str = '') → Tuple[str, str]

Splits a font into family and base name (family-variation). Optional type argument only used when an unnamed font is found.

Consistently handles font subsetting and variations.

Parameters:

fontname (str) – Full name of a font
type (str, optional) – Font type. Defaults to “”.

Returns:

(font family, font basename)

Return type:

Tuple[str, str]

to_json()

Convert the Font into a JSON object

Returns:: A JSON representation of the font.
Return type:: Dict[str, Any]

ImageElement 

class burdoc.elements.image.ImageElement(bbox: Bbox, original_bbox: Bbox, image: int, properties: Dict[str, Any], image_type: ImageType = ImageType.UNKNOWN, inline: bool = False)

Core element representing an image with a page layout

__init__(bbox: Bbox, original_bbox: Bbox, image: int, properties: Dict[str, Any], image_type: ImageType = ImageType.UNKNOWN, inline: bool = False)

Create an image element.

Parameters:

bbox (Bbox) – A Bbox representing the image’s visible extent
original_bbox (Bbox) – A Bbox representing the image’s true extent
image (int) – Index of page image store where image is found
properties (Dict[str, Any]) – Any additional properties of the image
image_type (ImageType, optional) – Purpose of the image. Default is UNKNOWN
inline (bool, optional) – Whether the image layout should be inline or additional. Generally set later in processing. Defaults to False.

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

enum burdoc.elements.image.ImageType(value)

Enumeration of types of images Burdoc understands.

INVISIBLE: Image isn’t visible on page
BACKGROUND: Image is used as background for the whole page
SECTION: Image is used as a background for a page section or aside
INLINE: Image is part of the flow of text (currently unused)
DECORATIVE: Image is a decorative element in the page layout but has no semantic meaning
PRIMARY: Image is a ‘hero’ image on the page
GRADIENT: Image is a smooth gradient used as a background
LINE: Image is used to semantically separate page sections

Valid values are as follows:

INVISIBLE = <ImageType.INVISIBLE: 1>

BACKGROUND = <ImageType.BACKGROUND: 2>

SECTION = <ImageType.SECTION: 3>

INLINE = <ImageType.INLINE: 4>

DECORATIVE = <ImageType.DECORATIVE: 5>

PRIMARY = <ImageType.PRIMARY: 6>

GRADIENT = <ImageType.GRADIENT: 7>

LINE = <ImageType.LINE: 8>

UNKNOWN = <ImageType.UNKNOWN: 9>

LineElement 

class burdoc.elements.line.LineElement(bbox: Bbox, spans: List[Span], rotation: Tuple[float, float])

Core element representing a line of text

__init__(bbox: Bbox, spans: List[Span], rotation: Tuple[float, float])

Creates a line element

Parameters:

bbox (Bbox) – Bbox of the extent of the line
spans (List[Span]) – List of text spans within the line, separation into spans implies a change in font.
rotation (List[float]) – Degree of rotation from the x-axis

static from_dict(line_dict: Dict[str, Any], page_width: float, page_height: float) → LineElement

Create a LineElement from a PyMuPDF line dictionary

Parameters:

line_dict (Dict[str, Any]) – The PyMuPDF dictionary
page_width (float) – Used to normalise bbox
page_height (float) – Used to normalise bbox

Returns:

LineElement

get_text() → str

Returns all text contained within the line as a string. This strips out any format or font information.

Returns:: str

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

PageSection 

class burdoc.elements.section.PageSection(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, default: bool = False, backing_drawing: DrawingElement | None = None, backing_image: ImageElement | None = None, inline: bool = False)

A fully contained section of the page on which layout analysis should be done independently.

__init__(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, default: bool = False, backing_drawing: DrawingElement | None = None, backing_image: ImageElement | None = None, inline: bool = False)

Create a PageSection. One of bbox or items must be provided

Parameters:

bbox (Optional[Bbox], optional) – BBox of the section. Defaults to None.
items (Optional[List[LayoutElement]], optional) – Items contained within the section. Defaults to None.
default (bool, optional) – Is this part of the underlying page or a subsection. Defaults to False.
backing_drawing (Optional[Any], optional) – Drawing used as the background for this section only. Defaults to None.
backing_image (Optional[Any], optional) – Image used as the background for this section only. Defaults to None.
inline (bool, optional) – Is this section inline with surrounding text. Usually inferred later in the pipeline. Defaults to False.

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]

Span 

class burdoc.elements.span.Span(bbox: Bbox, text: str, font: Font)

Representation of a continuous run of text with the same font information.

static from_dict(span_dict: Dict[str, Any], page_width: float, page_height: float)

Creates a Span from a PyMuPDF spac dictionary

Parameters:

span_dict (Dict[str, Any]) – The PyMuPDF span dictionary
page_width (float) – Used to normalise bbox
page_height (float) – Used to normalise bbox

Returns:

Span

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

Table 

class burdoc.elements.table.Table(bbox: Bbox, row_boxes: List[Tuple[TableParts, Bbox]], col_boxes: List[Tuple[TableParts, Bbox]], merge_boxes: List[Tuple[TableParts, Bbox]])

Representation of a table within the text.

__init__(bbox: Bbox, row_boxes: List[Tuple[TableParts, Bbox]], col_boxes: List[Tuple[TableParts, Bbox]], merge_boxes: List[Tuple[TableParts, Bbox]])

Creates a Table element

Parameters:

bbox (Bbox) – Bounding box of the the table
row_boxes (List[Tuple[TableParts, Bbox]]) – Bounding box and descriptor of each row - use TableParts.COLUMNHEADER to indicate a row used as a header
col_boxes (List[Tuple[TableParts, Bbox]]) – Bounding box and descriptor of each column - use TableParts.ROWHEADER to indicate a column used as a header

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

enum burdoc.elements.table.TableParts(value)

Enum defining the different parts of a table that can be extracted

Valid values are as follows:

TABLE = <TableParts.TABLE: 0>

COLUMN = <TableParts.COLUMN: 1>

ROW = <TableParts.ROW: 2>

COLUMNHEADER = <TableParts.COLUMNHEADER: 3>

ROWHEADER = <TableParts.ROWHEADER: 4>

SPANNINGCELL = <TableParts.SPANNINGCELL: 5>

TextBlock 

class burdoc.elements.textblock.TextBlock(bbox: Bbox | None = None, items: List[LineElement] | None = None, text_type: TextBlockType = TextBlockType.PARAGRAPH)

Represents a standard grouping of lines into a paragraph. All text within a textblock can be considered to be of semantically equivalent fonts. This may include variations in bold or italics.

get_text() → str

Returns all text contained within the block as a string This strips out any format or font information.

Returns:: str

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False, **kwargs)

Convert the textblock into a JSON object

Parameters:

extras (Dict[str, Any], optional) – Any additional fields that should be included. Defaults to None
include_bbox (bool, optional) – Defaults to False.
**kwargs – Arbitrary keyword arguments to be pass to superclass

Returns:

Dict[str, Any]

enum burdoc.elements.textblock.TextBlockType(value)

Possible types of text supported by the semantic classifier.

Valid values are as follows:

SMALL = <TextBlockType.SMALL: 1>

PARAGRAPH = <TextBlockType.PARAGRAPH: 2>

H1 = <TextBlockType.H1: 3>

H2 = <TextBlockType.H2: 4>

H3 = <TextBlockType.H3: 5>

H4 = <TextBlockType.H4: 6>

H5 = <TextBlockType.H5: 7>

H6 = <TextBlockType.H6: 8>

EMPHASIS = <TextBlockType.EMPHASIS: 9>

TextList 

class burdoc.elements.textlist.TextList(ordered: bool, bbox: Bbox | None = None, items: List[LayoutElement] | None = None)

An ordered or unordered list

__init__(ordered: bool, bbox: Bbox | None = None, items: List[LayoutElement] | None = None)

Create a text list. Must provide one of bbox or items. If items are provided the bbox will be inferred.

Parameters:

ordered (bool) – Is the list ordered (alphanumeric) or unordered (bullets)
bbox (Optional[Bbox], optional) – Bbox containing the list. Defaults to None.
items (Optional[List[LayoutElement]], optional) – Items making up the list. Defaults to None.

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]

class burdoc.elements.textlist.TextListItem(label: str, items: List[TextBlock])

A single item within a list. Equivalent to <li>

__init__(label: str, items: List[TextBlock])

Create a text list item

Parameters:

label (str) – The label of the list item. Can be bullet or alphanumeric
items (List[TextBlock]) – The content of the list item

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}

Parameters:

extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.
include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]