Working with Burdoc Output
By default, Burdoc returns a JSON dictionary containing the content extracted from the PDF. It will always contain ‘metadata’, ‘content’, and ‘page_hierarchy’ and may optionally contain extracted images and rendered page images
Metadata
Any file and content metadata produced during the extraction process
Field |
Type |
Description |
|---|---|---|
path |
str |
Path to the original file |
title |
str |
Title of the document, if available in the PDF, otherwise is the file name |
pdf_metadata |
object |
Metadata extracted by PyMuPDF |
toc |
list |
Table of content if stored programmatically within the pdf, otherwise [] |
Example Output:
{
"metadata": {
"path": "/path/to/file.pdf"
"title": "file.pdf"
"pdf_metadata": {
"format": "PDF 1.7",
"title": "",
"author": "Author",
"subject": "",
"keywords": "",
"creator": "Creator",
"producer": "Producer",
"creationDate": "D:20230325084930+00'00'",
"modDate": "D:20230325084930+00'00'",
"trapped": "",
"encryption": null
},
"toc": []
},
}
Content
The content field provides all extracted text, images, and tables indexed by page number and ordered in inferred reading order.
Field |
Type |
Description |
|---|---|---|
Page Index |
List[object] |
Page number of following content item. Zero-indexed |
Example:
{
"content": {
"0": [ #page number
{
"name": "textblock", #content form extracted
"type": "h2", #subtype of content
"items": [ #subitems within the content
{
"name": "line"
"spans": [
{
"name": "span",
"text": "A Test Document ",
"font": {
"name": "font",
"font": "TimesNewRomanPSMT",
"family": "TimesNewRomanPSMT",
"size": 20.0,
"colour": 0,
"bold": false,
"italic": false,
"superscript": false
} #/font
} #/span
], #/spans
} #/line
], #/items
}, #/textblock
] #/end page content
} #/end content
}
Content Types
Aside
HTML: <div>
An aside is a block of content visually separated from the main page, usually via a backing image/fill or by a boxed outline. Asides act as a basic container with any other type of content inside.
Field |
Type |
Description |
|---|---|---|
name |
str |
“aside” |
items |
list[Content Object] |
List of content items, cannot contain another aside |
Example:
{
'name': 'aside',
'items': [
{
"name": "textblock", #content form extracted
"type": "h2", #subtype of content
"items": [ #subitems within the content
{
"name": "line"
"spans": [
{
"name": "span",
"text": "A Test Document ",
"font": {
"name": "font",
"font": "TimesNewRomanPSMT",
"family": "TimesNewRomanPSMT",
"size": 20.0,
"colour": 0,
"bold": false,
"italic": false,
"superscript": false
} #/font
} #/span
], #/spans
} #/line
], #/items
}, #/textblock
]
}
Font
This expresses the font information of text
Field |
Type |
Description |
|---|---|---|
name |
str |
“font” |
font |
str |
Name of the font |
family |
str |
Name of the inferred font family |
size |
float |
Font size in pt |
colour |
int |
Font colour |
bold |
bool |
True if text is bold |
italic |
bool |
True if text is italic |
superscript |
bool |
True if text is superscript |
smallcaps |
bool |
True if font is a smallcaps font |
Image
HTML: <image>
The fact of an image being present is always extracted, whether or not the images themselves are stored.
Field |
Type |
Description |
|---|---|---|
name |
str |
“image” |
image_type |
str |
Category of the image, usually [‘primary’] for extracted images |
image |
int |
Index of image within extracted image list |
Line
HTML: None
A line is a single line of text, which may contain multiple spans with differing font information. The end of a line indicates a line break in the original text, not the end of a semantic sentence.
Field |
Type |
Description |
|---|---|---|
name |
str |
“line” |
spans |
list[Span] |
List of text spans |
Span
HTML: <span>
A span is a grouping of test within a line based on font information
Field |
Type |
Description |
|---|---|---|
name |
str |
“span” |
text |
str |
The extracted text |
font |
Font |
Font information for the span |
Table
HTML: <table>
Representation of any tables extracted from the document.
Field |
Type |
Description |
|---|---|---|
name |
str |
“table” |
cells |
list[list[list[TextBlock]]] |
Extracted cells in row-column-cell nesting. Note cells can contain multiple text blocks. |
row_header_index |
list[int] |
Indexes of any columns that should be treated as row headers. |
col_header_index |
list[int] |
Indexes of any rows that should be treated as column headers. |
TextBlock
HTML: <p>,<h[1-5]>
A set of lines that has been inferred to be part of the same grouping. Usually represents a paragraph.
Field |
Type |
Description |
|---|---|---|
name |
str |
“textblock” |
type |
str |
Inferred interpretation of the text. One of [‘paragraph’, ‘h[1-5]’, ‘emphasis’, ‘small’] |
items |
list[Line] |
All lines contained within the block. |
block_text |
str |
A basic representation of all text within the block with all font information removed. |
TextList
HTML: <ul>,<ol>
An ordered or unordered list
Field |
Type |
Description |
|---|---|---|
name |
str |
“textlist” |
ordered |
bool |
Whether the list is ordered or unordered |
items |
list[TextListItem] |
All items contained within the list |
TextListItem
Field |
Type |
Description |
|---|---|---|
name |
str |
“textlistitem” |
label |
str |
The extract label, can be one of [\u2022, (a), a), a., (1), 1), 1.] |
items |
list[TextBlock] |
All paragraphs contained within this list item. |
Page Hierarchy
The page hierarchy is an inferred table of contents based on headers found within the text. It is indexed by page, similarly to the ‘content’ field.
Field |
Type |
Description |
|---|---|---|
page |
int |
Page index |
index |
Tuple[int, int or None] |
The item and sub-item index. Sub-item index is none if the element is not within an aside |
text |
str |
Simplified text reprentation of the heading. All font information is removed |
size |
float |
Font size in pt |
Images (--images only)
If extracted using the “–images” flag or ‘extract_images’ argument, images are stored as a page-indexed list of base64 encoded images. If the images flag is not used, ImageElements will still be present in the extracted content but they won’t contain the actual image data.
Font Statistics (--detailed only)
If ‘detailed’ extraction mode is used then font statistics for the full document will be extracted. This includes information on each font, it’s occurences, and it’s actual size on the page.