Working with Burdoc Output

By default, Burdoc returns a JSON dictionary containing the content extracted from the PDF. It will always contain ‘metadata’, ‘content’, and ‘page_hierarchy’ and may optionally contain extracted images and rendered page images

Metadata 

Any file and content metadata produced during the extraction process

Field	Type	Description
path	str	Path to the original file
title	str	Title of the document, if available in the PDF, otherwise is the file name
pdf_metadata	object	Metadata extracted by PyMuPDF
toc	list	Table of content if stored programmatically within the pdf, otherwise []

Example Output:

{
    "metadata": {
        "path": "/path/to/file.pdf" 
        "title": "file.pdf" 
        "pdf_metadata": { 
            "format": "PDF 1.7",
            "title": "",
            "author": "Author",
            "subject": "",
            "keywords": "",
            "creator": "Creator",
            "producer": "Producer",
            "creationDate": "D:20230325084930+00'00'",
            "modDate": "D:20230325084930+00'00'",
            "trapped": "",
            "encryption": null
        }, 
        "toc": [] 
    },
}

Content 

The content field provides all extracted text, images, and tables indexed by page number and ordered in inferred reading order.

Field	Type	Description
Page Index	List[object]	Page number of following content item. Zero-indexed

Example:

{
 "content": {
        "0": [ #page number
            {
                "name": "textblock", #content form extracted
                "type": "h2", #subtype of content
                "items": [ #subitems within the content
                    {
                        "name": "line"
                        "spans": [
                            {
                                "name": "span",
                                "text": "A Test Document ",
                                "font": {
                                    "name": "font",
                                    "font": "TimesNewRomanPSMT",
                                    "family": "TimesNewRomanPSMT",
                                    "size": 20.0,
                                    "colour": 0,
                                    "bold": false,
                                    "italic": false,
                                    "superscript": false
                                } #/font
                            } #/span
                        ], #/spans
                    } #/line
                ], #/items
            }, #/textblock
        ] #/end page content
    } #/end content
 }

Content Types 

Aside 

HTML: <div>

An aside is a block of content visually separated from the main page, usually via a backing image/fill or by a boxed outline. Asides act as a basic container with any other type of content inside.

Field	Type	Description
name	str	“aside”
items	list[Content Object]	List of content items, cannot contain another aside

Example:

{
    'name': 'aside',
    'items': [
        {
            "name": "textblock", #content form extracted
            "type": "h2", #subtype of content
            "items": [ #subitems within the content
                {
                    "name": "line"
                    "spans": [
                        {
                            "name": "span",
                            "text": "A Test Document ",
                            "font": {
                                "name": "font",
                                "font": "TimesNewRomanPSMT",
                                "family": "TimesNewRomanPSMT",
                                "size": 20.0,
                                "colour": 0,
                                "bold": false,
                                "italic": false,
                                "superscript": false
                            } #/font
                        } #/span
                    ], #/spans
                } #/line
            ], #/items
        }, #/textblock
    ]
}

Font 

This expresses the font information of text

Field	Type	Description
name	str	“font”
font	str	Name of the font
family	str	Name of the inferred font family
size	float	Font size in pt
colour	int	Font colour
bold	bool	True if text is bold
italic	bool	True if text is italic
superscript	bool	True if text is superscript
smallcaps	bool	True if font is a smallcaps font

Image 

HTML: <image>

The fact of an image being present is always extracted, whether or not the images themselves are stored.

Field	Type	Description
name	str	“image”
image_type	str	Category of the image, usually [‘primary’] for extracted images
image	int	Index of image within extracted image list

Line 

HTML: None

A line is a single line of text, which may contain multiple spans with differing font information. The end of a line indicates a line break in the original text, not the end of a semantic sentence.

Field	Type	Description
name	str	“line”
spans	list[Span]	List of text spans

Span 

HTML: <span>

A span is a grouping of test within a line based on font information

Field	Type	Description
name	str	“span”
text	str	The extracted text
font	Font	Font information for the span

Table 

HTML: <table>

Representation of any tables extracted from the document.

Field	Type	Description
name	str	“table”
cells	list[list[list[TextBlock]]]	Extracted cells in row-column-cell nesting. Note cells can contain multiple text blocks.
row_header_index	list[int]	Indexes of any columns that should be treated as row headers.
col_header_index	list[int]	Indexes of any rows that should be treated as column headers.

TextBlock 

HTML: <p>,<h[1-5]>

A set of lines that has been inferred to be part of the same grouping. Usually represents a paragraph.

Field	Type	Description
name	str	“textblock”
type	str	Inferred interpretation of the text. One of [‘paragraph’, ‘h[1-5]’, ‘emphasis’, ‘small’]
items	list[Line]	All lines contained within the block.
block_text	str	A basic representation of all text within the block with all font information removed.

TextList 

HTML: <ul>,<ol>

An ordered or unordered list

Field	Type	Description
name	str	“textlist”
ordered	bool	Whether the list is ordered or unordered
items	list[TextListItem]	All items contained within the list

TextListItem 

Field	Type	Description
name	str	“textlistitem”
label	str	The extract label, can be one of [\u2022, (a), a), a., (1), 1), 1.]
items	list[TextBlock]	All paragraphs contained within this list item.

Page Hierarchy 

The page hierarchy is an inferred table of contents based on headers found within the text. It is indexed by page, similarly to the ‘content’ field.

Field	Type	Description
page	int	Page index
index	Tuple[int, int or None]	The item and sub-item index. Sub-item index is none if the element is not within an aside
text	str	Simplified text reprentation of the heading. All font information is removed
size	float	Font size in pt

Images (`--images` only)

If extracted using the “–images” flag or ‘extract_images’ argument, images are stored as a page-indexed list of base64 encoded images. If the images flag is not used, ImageElements will still be present in the extracted content but they won’t contain the actual image data.

Font Statistics (`--detailed` only)

If ‘detailed’ extraction mode is used then font statistics for the full document will be extracted. This includes information on each font, it’s occurences, and it’s actual size on the page.