Elements

Elements are (mostly) objects that have a physical location within a PDF page.

Elements generally inherit from LayoutElement to provide a consistent interface for accessing location information.

LayoutElement

class burdoc.elements.element.LayoutElement(bbox: Bbox, title: str = 'LayoutElement')

Base class for any layout object within the PDF. LayoutElements can be used to describe anything that has a bbox.

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False) Dict[str, Any]

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

LayoutElementGroup

class burdoc.elements.element.LayoutElementGroup(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, title: str = 'LayoutElementGroup')

Base class for any coherent group of layout objects within the PDF. The BBox of the LayoutElementGroup is the rectangle encompassing all Bboxes of it’s members.

append(item: LayoutElement, update_bbox: bool = True)

Add an item to the group

Parameters:
  • item (LayoutElement) – Item to add

  • update_bbox (bool, optional) – Should the group Bbox be recalculated or ignored? Useful when items are non-contigous (e.g. they cross columns or pages). Defaults to True.

merge(leg: LayoutElementGroup) LayoutElementGroup

In-place merge with another LayoutElementGroup

Parameters:

leg (LayoutElementGroup) – LEG to merge with

Returns:

A reference to self

Return type:

LayoutElementGroup

remove(item: LayoutElement, update_bbox: bool = True)

Remove an item from the group

Parameters:
  • item (LayoutElement) – Item to remove

  • update_bbox (bool, optional) – Should the group Bbox be recalculated or ignored? Defaults to True.

Raises:

ValueError – Item not present in list

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False) Dict[str, Any]

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]

Aside

class burdoc.elements.aside.Aside(bbox: Bbox | None = None, items: List[LayoutElement] | None = None)

A small delimited section of text that is separate from the surrounding flow.

Bbox

class burdoc.elements.bbox.Bbox(x0: float, y0: float, x1: float, y1: float, page_width: float, page_height: float)

Utility class for storing and manipulating bounding boxes.

area() float

Returns total area of bbox

Returns:

float

area_norm() float

Returns total area of bbox as a percentage of the page area.

Returns:

float

center(norm: bool = False) Point

Returns a point representing the center of the bounding box.

Parameters:
  • norm (bool, optional) – Return page-normalised co-ordinates.

  • False. (Defaults to) –

Returns:

Point

clone() Bbox

Returns a clone of the bounding box

Returns:

Bbox

static from_points(p1: Point, p2: Point, page_width: float, page_height: float) Bbox

Create a Bbox spanning two points.

Parameters:
  • p1 (Point) – Top-left corner

  • p2 (Point) – Bottom-right corner

  • page_width (float) – Width of page

  • page_height (float) – Height of page

Returns:

Bbox

height(norm: bool = False) float

Returns the height of the bounding box.

Parameters:
  • norm (bool, optional) – Return page-normalised co-ordinates.

  • False. (Defaults to) –

Returns:

float

is_vertical() bool

Test if the Bounding box is oriented vertically. I.e. that the height is greater than the width.

Returns:

_description_

Return type:

bool

static merge(bboxes: List[Bbox]) Bbox

Merge several Bboxes into a single bbox spanning all of them

Parameters:

bboxes (List[Bbox]) –

Raises:

ValueError – No bboxes passed

Returns:

Bbox

overlap(other_bbox: Bbox, normalisation: str = '') float

Calculates the overall overlap between this Bbox and another. Several normalisation options are provided:

“”: No normalisation “first”: Return as percent of calling bounding box area “second”: Return as percent of passed bounding box area “min”: Return as percent of smallest area “max”: Return as percent of largest area “page”: Return as percent of page area

Parameters:
  • other_bbox (Bbox) – Passed bbox

  • normalisation (str, optional) – Normalisation option. Defaults to “”.

Returns:

float

to_json(include_page=False) Dict[str, float]

Convert a Bbox to JSON format.

Example:

{
    'x0':float,
    'y0':float,
    'x1':float,
    'y1':float,
    'pw':float [optional],
    'ph':float [optional]
}
Parameters:

include_page (bool, optional) – Include page width and height. Defaults to False.

Returns:

Dict[str, float]

to_rect() List[float]

Returns a four point representation of the box.

Returns:

[x0, y0, x1, y1]

Return type:

List[float]

width(norm: bool = False) float

Returns the width of the bounding box.

Parameters:
  • norm (bool, optional) – Return page-normalised co-ordinates.

  • False. (Defaults to) –

Returns:

float

x0_norm() float

x0 normalised by its position on the page

Returns:

float

x1_norm() float

x1 normalised by its position on the page

Returns:

float

x_distance(other_bbox: Bbox) float

Returns the distance between called and passed Bbox in the x direction. Note that this is calculated centre to centre. It returns negatively if passed Bbox is below this Bbox.

Parameters:

other_bbox (Bbox) –

Returns:

float

x_overlap(other_bbox: Bbox, normalisation: str = '') float

Calculates the projected overlap between this Bbox and another in the x axis. Several normalisation options are provided:

“”: No normalisation “first”: Return as percent of calling bounding box width “second”: Return as percent of passed bounding box width “min”: Return as percent of thinnest box width “max”: Return as percent of widest box width “page”: Return as percent of page width

Parameters:
  • other_bbox (Bbox) – Passed bbox

  • normalisation (str, optional) – Normalisation option. Defaults to “”.

Returns:

float

y0_norm() float

y0 normalised by its position on the page

Returns:

float

y1_norm() float

y1 normalised by its position on the page

Returns:

float

y_distance(other_bbox: Bbox) float

Returns the distance between called and passed Bbox in the y direction. Note that this is calculated centre to centre. It returns negatively if passed Bbox is to the right of this Bbox.

Parameters:

other_bbox (Bbox) –

Returns:

float

y_overlap(other_bbox: Bbox, normalisation: str = '') float

Calculates the projected overlap between this Bbox and another in the y axis. Several normalisation options are provided:

“”: No normalisation “first”: Return as percent of calling bounding box height “second”: Return as percent of passed bounding box height “min”: Return as percent of shortest box height “max”: Return as percent of tallest box height “page”: Return as percent of page height

Parameters:
  • other_bbox (Bbox) – Passed bbox

  • normalisation (str, optional) – Normalisation option. Defaults to “”.

Returns:

float

class burdoc.elements.bbox.Point(x: 'float', y: 'float')

DrawingElement

class burdoc.elements.drawing.DrawingElement(bbox: Bbox, drawing_type: DrawingType = DrawingType.UNKNOWN, fill_opacity: float = 0.0, fill_colour: ndarray | None = None, stroke_opacity: float = 0.0, stroke_colour: ndarray | None = None, stroke_width: float | None = None)

Core element representing a drawing

__init__(bbox: Bbox, drawing_type: DrawingType = DrawingType.UNKNOWN, fill_opacity: float = 0.0, fill_colour: ndarray | None = None, stroke_opacity: float = 0.0, stroke_colour: ndarray | None = None, stroke_width: float | None = None)

Creates a drawing element.

Parameters:
  • bbox (Bbox) – Bbox of the extent of the drawing

  • opacity (float) – Opacity of the drawing

  • drawing_type (DrawingType, optional) – Semantic purpose of the drawing. Default is UNKNOWN

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

enum burdoc.elements.drawing.DrawingType(value)

Enumeration of types of drawing Burdoc understands.

  • LINE: Anything long and thin used as a visual separator

  • RECT: Usually means a square or outer edge defining an aside or section

  • TABLE: A collection of rectangles in a common table pattern

  • BULLET: A small circle indicating a textual bullet point.

  • UNKNOWN: An unknown drawing type

Valid values are as follows:

LINE = <DrawingType.LINE: 1>
RECT = <DrawingType.RECT: 2>
BULLET = <DrawingType.BULLET: 3>
TABLE = <DrawingType.TABLE: 4>
UNKNOWN = <DrawingType.UNKNOWN: 5>

Font

class burdoc.elements.font.Font(name: str, family: str, size: float, colour: int, bold: bool, italic: bool, superscript: bool, smallcaps: bool)

Representation of font information

static from_dict(span_dict: Dict[str, Any])

Creates Font object from a PyMuPDF span

Some properties are inferred based on PyMuPDF flags, others are set dynamically from the font name

Parameters:

font_doct (Dict[str, Any]) – _description_

static split_font_name(fontname: str, type: str = '') Tuple[str, str]

Splits a font into family and base name (family-variation). Optional type argument only used when an unnamed font is found.

Consistently handles font subsetting and variations.

Parameters:
  • fontname (str) – Full name of a font

  • type (str, optional) – Font type. Defaults to “”.

Returns:

(font family, font basename)

Return type:

Tuple[str, str]

to_json()

Convert the Font into a JSON object

Returns:

A JSON representation of the font.

Return type:

Dict[str, Any]

ImageElement

class burdoc.elements.image.ImageElement(bbox: Bbox, original_bbox: Bbox, image: int, properties: Dict[str, Any], image_type: ImageType = ImageType.UNKNOWN, inline: bool = False)

Core element representing an image with a page layout

__init__(bbox: Bbox, original_bbox: Bbox, image: int, properties: Dict[str, Any], image_type: ImageType = ImageType.UNKNOWN, inline: bool = False)

Create an image element.

Parameters:
  • bbox (Bbox) – A Bbox representing the image’s visible extent

  • original_bbox (Bbox) – A Bbox representing the image’s true extent

  • image (int) – Index of page image store where image is found

  • properties (Dict[str, Any]) – Any additional properties of the image

  • image_type (ImageType, optional) – Purpose of the image. Default is UNKNOWN

  • inline (bool, optional) – Whether the image layout should be inline or additional. Generally set later in processing. Defaults to False.

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

enum burdoc.elements.image.ImageType(value)

Enumeration of types of images Burdoc understands.

  • INVISIBLE: Image isn’t visible on page

  • BACKGROUND: Image is used as background for the whole page

  • SECTION: Image is used as a background for a page section or aside

  • INLINE: Image is part of the flow of text (currently unused)

  • DECORATIVE: Image is a decorative element in the page layout but has no semantic meaning

  • PRIMARY: Image is a ‘hero’ image on the page

  • GRADIENT: Image is a smooth gradient used as a background

  • LINE: Image is used to semantically separate page sections

Valid values are as follows:

INVISIBLE = <ImageType.INVISIBLE: 1>
BACKGROUND = <ImageType.BACKGROUND: 2>
SECTION = <ImageType.SECTION: 3>
INLINE = <ImageType.INLINE: 4>
DECORATIVE = <ImageType.DECORATIVE: 5>
PRIMARY = <ImageType.PRIMARY: 6>
GRADIENT = <ImageType.GRADIENT: 7>
LINE = <ImageType.LINE: 8>
UNKNOWN = <ImageType.UNKNOWN: 9>

LineElement

class burdoc.elements.line.LineElement(bbox: Bbox, spans: List[Span], rotation: Tuple[float, float])

Core element representing a line of text

__init__(bbox: Bbox, spans: List[Span], rotation: Tuple[float, float])

Creates a line element

Parameters:
  • bbox (Bbox) – Bbox of the extent of the line

  • spans (List[Span]) – List of text spans within the line, separation into spans implies a change in font.

  • rotation (List[float]) – Degree of rotation from the x-axis

static from_dict(line_dict: Dict[str, Any], page_width: float, page_height: float) LineElement

Create a LineElement from a PyMuPDF line dictionary

Parameters:
  • line_dict (Dict[str, Any]) – The PyMuPDF dictionary

  • page_width (float) – Used to normalise bbox

  • page_height (float) – Used to normalise bbox

Returns:

LineElement

get_text() str

Returns all text contained within the line as a string. This strips out any format or font information.

Returns:

str

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

PageSection

class burdoc.elements.section.PageSection(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, default: bool = False, backing_drawing: DrawingElement | None = None, backing_image: ImageElement | None = None, inline: bool = False)

A fully contained section of the page on which layout analysis should be done independently.

__init__(bbox: Bbox | None = None, items: List[LayoutElement] | None = None, default: bool = False, backing_drawing: DrawingElement | None = None, backing_image: ImageElement | None = None, inline: bool = False)

Create a PageSection. One of bbox or items must be provided

Parameters:
  • bbox (Optional[Bbox], optional) – BBox of the section. Defaults to None.

  • items (Optional[List[LayoutElement]], optional) – Items contained within the section. Defaults to None.

  • default (bool, optional) – Is this part of the underlying page or a subsection. Defaults to False.

  • backing_drawing (Optional[Any], optional) – Drawing used as the background for this section only. Defaults to None.

  • backing_image (Optional[Any], optional) – Image used as the background for this section only. Defaults to None.

  • inline (bool, optional) – Is this section inline with surrounding text. Usually inferred later in the pipeline. Defaults to False.

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]

Span

class burdoc.elements.span.Span(bbox: Bbox, text: str, font: Font)

Representation of a continuous run of text with the same font information.

static from_dict(span_dict: Dict[str, Any], page_width: float, page_height: float)

Creates a Span from a PyMuPDF spac dictionary

Parameters:
  • span_dict (Dict[str, Any]) – The PyMuPDF span dictionary

  • page_width (float) – Used to normalise bbox

  • page_height (float) – Used to normalise bbox

Returns:

Span

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

Table

class burdoc.elements.table.Table(bbox: Bbox, row_boxes: List[Tuple[TableParts, Bbox]], col_boxes: List[Tuple[TableParts, Bbox]], merge_boxes: List[Tuple[TableParts, Bbox]])

Representation of a table within the text.

__init__(bbox: Bbox, row_boxes: List[Tuple[TableParts, Bbox]], col_boxes: List[Tuple[TableParts, Bbox]], merge_boxes: List[Tuple[TableParts, Bbox]])

Creates a Table element

Parameters:
  • bbox (Bbox) – Bounding box of the the table

  • row_boxes (List[Tuple[TableParts, Bbox]]) – Bounding box and descriptor of each row - use TableParts.COLUMNHEADER to indicate a row used as a header

  • col_boxes (List[Tuple[TableParts, Bbox]]) – Bounding box and descriptor of each column - use TableParts.ROWHEADER to indicate a column used as a header

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Convert the object into a JSON object

Example JSON:

{
    "name": "LayoutElement",
    "bbox": {...} [optional]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that should be included within the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the object.

Return type:

Dict[str, Any]

enum burdoc.elements.table.TableParts(value)

Enum defining the different parts of a table that can be extracted

Valid values are as follows:

TABLE = <TableParts.TABLE: 0>
COLUMN = <TableParts.COLUMN: 1>
ROW = <TableParts.ROW: 2>
COLUMNHEADER = <TableParts.COLUMNHEADER: 3>
ROWHEADER = <TableParts.ROWHEADER: 4>
SPANNINGCELL = <TableParts.SPANNINGCELL: 5>

TextBlock

class burdoc.elements.textblock.TextBlock(bbox: Bbox | None = None, items: List[LineElement] | None = None, text_type: TextBlockType = TextBlockType.PARAGRAPH)

Represents a standard grouping of lines into a paragraph. All text within a textblock can be considered to be of semantically equivalent fonts. This may include variations in bold or italics.

get_text() str

Returns all text contained within the block as a string This strips out any format or font information.

Returns:

str

to_json(extras: Dict[str, Any] | None = None, include_bbox: bool = False, **kwargs)

Convert the textblock into a JSON object

Parameters:
  • extras (Dict[str, Any], optional) – Any additional fields that should be included. Defaults to None

  • include_bbox (bool, optional) – Defaults to False.

  • **kwargs – Arbitrary keyword arguments to be pass to superclass

Returns:

Dict[str, Any]

enum burdoc.elements.textblock.TextBlockType(value)

Possible types of text supported by the semantic classifier.

Valid values are as follows:

SMALL = <TextBlockType.SMALL: 1>
PARAGRAPH = <TextBlockType.PARAGRAPH: 2>
H1 = <TextBlockType.H1: 3>
H2 = <TextBlockType.H2: 4>
H3 = <TextBlockType.H3: 5>
H4 = <TextBlockType.H4: 6>
H5 = <TextBlockType.H5: 7>
H6 = <TextBlockType.H6: 8>
EMPHASIS = <TextBlockType.EMPHASIS: 9>

TextList

class burdoc.elements.textlist.TextList(ordered: bool, bbox: Bbox | None = None, items: List[LayoutElement] | None = None)

An ordered or unordered list

__init__(ordered: bool, bbox: Bbox | None = None, items: List[LayoutElement] | None = None)

Create a text list. Must provide one of bbox or items. If items are provided the bbox will be inferred.

Parameters:
  • ordered (bool) – Is the list ordered (alphanumeric) or unordered (bullets)

  • bbox (Optional[Bbox], optional) – Bbox containing the list. Defaults to None.

  • items (Optional[List[LayoutElement]], optional) – Items making up the list. Defaults to None.

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]

class burdoc.elements.textlist.TextListItem(label: str, items: List[TextBlock])

A single item within a list. Equivalent to <li>

__init__(label: str, items: List[TextBlock])

Create a text list item

Parameters:
  • label (str) – The label of the list item. Can be bullet or alphanumeric

  • items (List[TextBlock]) – The content of the list item

to_json(extras: Dict | None = None, include_bbox: bool = False, **kwargs)

Creates JSON object from LayoutElementGroup

Example JSON:

{
    "name": "LayoutElementGroup",
    "bbox": {...} [optional]
    "items": [{...}]
}
Parameters:
  • extras (Optional[Dict[str, Any]], optional) – Any additional items that need to be included in the JSON. Defaults to None.

  • include_bbox (bool, optional) – Include the bounding box. Defaults to False.

Returns:

A JSON representation of the group

Return type:

Dict[str, Any]