# Working with Burdoc Output ```{toctree} :maxdepth: 4 ``` By default, Burdoc returns a JSON dictionary containing the content extracted from the PDF. It will always contain 'metadata', 'content', and 'page_hierarchy' and may optionally contain extracted images and rendered page images ```{contents} :depth: 4 :local: :backlinks: top ``` ## Metadata Any file and content metadata produced during the extraction process | Field | Type | Description |-|-|-| | path | str | Path to the original file | | title | str | Title of the document, if available in the PDF, otherwise is the file name | | pdf_metadata | object | Metadata extracted by PyMuPDF | | toc | list | Table of content if stored programmatically within the pdf, otherwise [] | Example Output: ```python { "metadata": { "path": "/path/to/file.pdf" "title": "file.pdf" "pdf_metadata": { "format": "PDF 1.7", "title": "", "author": "Author", "subject": "", "keywords": "", "creator": "Creator", "producer": "Producer", "creationDate": "D:20230325084930+00'00'", "modDate": "D:20230325084930+00'00'", "trapped": "", "encryption": null }, "toc": [] }, } ``` ___ ## Content The content field provides all extracted text, images, and tables indexed by page number and ordered in inferred reading order. | Field | Type | Description | |-|-|-| | Page Index | List[object] | Page number of following content item. Zero-indexed | Example: ```python { "content": { "0": [ #page number { "name": "textblock", #content form extracted "type": "h2", #subtype of content "items": [ #subitems within the content { "name": "line" "spans": [ { "name": "span", "text": "A Test Document ", "font": { "name": "font", "font": "TimesNewRomanPSMT", "family": "TimesNewRomanPSMT", "size": 20.0, "colour": 0, "bold": false, "italic": false, "superscript": false } #/font } #/span ], #/spans } #/line ], #/items }, #/textblock ] #/end page content } #/end content } ``` ___ ### Content Types #### Aside HTML: ```
``` An aside is a block of content visually separated from the main page, usually via a backing image/fill or by a boxed outline. Asides act as a basic container with any other type of content inside. | Field | Type | Description | |-|-|-| | name | str | "aside" | | items | list[Content Object] | List of content items, cannot contain another aside | Example: ```python { 'name': 'aside', 'items': [ { "name": "textblock", #content form extracted "type": "h2", #subtype of content "items": [ #subitems within the content { "name": "line" "spans": [ { "name": "span", "text": "A Test Document ", "font": { "name": "font", "font": "TimesNewRomanPSMT", "family": "TimesNewRomanPSMT", "size": 20.0, "colour": 0, "bold": false, "italic": false, "superscript": false } #/font } #/span ], #/spans } #/line ], #/items }, #/textblock ] } ``` #### Font This expresses the font information of text | Field | Type | Description | |-|-|-| | name | str | "font" | | font | str | Name of the font | | family | str | Name of the inferred font family | | size | float | Font size in pt | | colour | int | Font colour | | bold | bool | True if text is bold | | italic | bool | True if text is italic | | superscript | bool | True if text is superscript | | smallcaps | bool | True if font is a smallcaps font | #### Image HTML: `````` The fact of an image being present is always extracted, whether or not the images themselves are stored. | Field | Type | Description | |-|-|-| | name | str | "image" | | image_type | str | Category of the image, usually ['primary'] for extracted images | image | int | Index of image within extracted image list | #### Line HTML: None A line is a single line of text, which may contain multiple spans with differing font information. The end of a line indicates a line break in the original text, not the end of a semantic sentence. | Field | Type | Description | |-|-|-| | name | str | "line" | | spans | list[Span] | List of text spans | #### Span HTML: `````` A span is a grouping of test within a line based on font information | Field | Type | Description | |-|-|-| | name | str | "span" | | text | str | The extracted text | | font | Font | Font information for the span | #### Table HTML: `````` Representation of any tables extracted from the document. | Field | Type | Description | |-|-|-| | name | str | "table" | | cells | list[list[list[TextBlock]]] | Extracted cells in row-column-cell nesting. Note cells can contain multiple text blocks. | | row_header_index | list[int] | Indexes of any columns that should be treated as row headers. | | col_header_index | list[int] | Indexes of any rows that should be treated as column headers. | #### TextBlock HTML: ```

,``` A set of lines that has been inferred to be part of the same grouping. Usually represents a paragraph. | Field | Type | Description | |-|-|-| | name | str | "textblock" | | type | str | Inferred interpretation of the text. One of ['paragraph', 'h[1-5]', 'emphasis', 'small'] | | items | list[Line] | All lines contained within the block. | | block_text | str | A basic representation of all text within the block with all font information removed. | #### TextList HTML: ```