Converting PDFs to Structured JSON

PDF files are commonly used for storing and sharing documents, but extracting data from them can be a challenging task. The PDF-GPT4-JSON project aims to simplify this process by leveraging the power of GPT-4 Vision, a state-of-the-art language model, to convert PDF files into structured JSON format. In this article, we will explore the theory behind this conversion process and discuss how it can be applied in real-world scenarios.

The Challenge of OCR Data in PDFs

One of the main challenges in extracting data from PDFs is the accuracy of the OCR (Optical Character Recognition) process. OCR is used to convert scanned or image-based PDFs into searchable and editable text. However, OCR data can often contain inaccuracies and garbage characters, especially in complex layouts or low-quality scans. This can result in errors and inconsistencies in the extracted text.

To address this challenge, the PDF-GPT4-JSON cli uses GPT-4 Vision, which has been fine-tuned for image understanding and analysis. By leveraging deep learning techniques, GPT-4 Vision can effectively analyze the layout of the text in PDFs and infer the hierarchical structure of the data. This helps to mitigate the impact of inaccurate OCR data and generate more accurate and structured JSON output.

Generating Structured JSON with GPT-4 Vision

The process of generating structured JSON using GPT-4 Vision involves several steps:

PDF Parsing: The PDF file is parsed to extract the textual content and layout information of each page. This includes identifying the position, size, and formatting of the text elements.
Text Extraction: The extracted text is processed to remove noise and irrelevant information, such as headers, footers, and page numbers. This helps to focus on the main content of the PDF.
Layout Analysis: GPT-4 Vision analyzes the layout of the text on each page to identify the hierarchical structure of the data. It looks for patterns, indentation, and formatting cues to infer the relationships between different elements. For example, it can identify headings, subheadings, lists, and tables.
JSON Generation: Based on the layout analysis, GPT-4 Vision generates a structured JSON representation of the PDF content. Each page is represented as a separate JSON file, with nested objects and arrays to capture the hierarchical relationships. This allows for easy navigation and extraction of specific data elements.

Installation and Usage

This aritcles assumes you have installed Python 3.10 or greater.

To use the PDF-GPT4-JSON cli, you need to install it via pip :

pip install pdf_gpt4_json

You also need to set your OpenAI API key by either exporting it as an environment variable or passing it as a command-line argument to the tool.

Once installed, you can run the conversion script by providing the path to the PDF file:

pdf-gpt4-json ./sample.pdf

This will generate a temporary working folder and an output folder with JSON files for each page of the PDF. The output folder will be named after the PDF file, with the prefix "samplepdf_final_folders" in this case.

The project also provides additional parameters that can be adjusted to customize the conversion process. These parameters include the path to a prompt file, the OpenAI API key, the model to use, verbosity level, and whether to clean up temporary files after processing.

Based on the first page of our sample.pdf [ original document from propublica ]a IRS 990 tax form (this is a public document that non profits must file, company Employer Identification Number (EIN) is public so it was not redacted.) it can output the following json:

{
    "Form": "990-PF",
    "Return of Private Foundation": {
        "Year": "2022",
        "Tax year beginning": "01-01-2022",
        "Tax year ending": "12-31-2022"
    },
    "Name of foundation": "THE RHODODS FOUNDATION",
    "Address": {
        "Number and street": "13-15 W 54th ST",
        "City or town": "NEW YORK",
        "State": "NY",
        "ZIP code": "10019"
    },
    "Employer identification number": "23-102392",
    "Part I - Analysis of Revenue and Expenses": {
        "Contributions, gifts, grants, etc., received": "",
        "Interest on savings and temporary cash investments": "",
        "Dividends and interest from securities": "280,358",
        "Gross rents": "",
        "Net rental income or (loss)": "",
        "Net gain or (loss) from sale of assets not on line 10": "-6,068",
        "Capital gain net income (from Part IV, line 2)": "3,219,668",
        "Net short-term capital gain": "",
        "Income modifications": "",
        "Total (add lines 1 through 9)": "3,494,040",
        "Expenses and Disbursements for Charitable Purposes (attach schedule)": {
            "Compensation of officers, directors, trustees, etc.": "",
            "Other employee salaries and wages": "",
            "Pension plans, employee benefits": "",
            "Legal fees (attach schedule)": "",
            "Accounting fees (attach schedule)": "13,000",
            "Other professional fees (attach schedule)": "6,500",
            "Interest": "",
            "Taxes (attach schedule)": "",
            "Depreciation (attach schedule) and depletion": "",
            "Occupancy": "",
            "Travel, conferences, and meetings": "",
            "Printing and publications": "",
            "Other expenses (attach schedule)": "53,134",
            "Total operating and administrative expenses": "157,584",
            "Contributions, gifts, grants paid": "555,082",
            "Total expenses and disbursements": "712,666",
            "Excess of revenue over expenses and disbursements": "2,781,374",
            "Net investment income": "2,485,978",
            "Adjusted net income": "231,296"
        }
    }
}

Applications and Benefits

The PDF-GPT4-JSON project opens up a wide range of possibilities for developers and data analysts. Here are some potential applications and benefits:

Data Extraction: The structured JSON output makes it easy to extract specific data elements from PDFs, such as tables, lists, or headings. This can be useful for data analysis, data mining, or integrating PDF data into other systems.
Automation: By automating the PDF-to-JSON conversion process, developers can save time and effort in manually extracting data from PDFs. This can be particularly beneficial for large volumes of PDF files or recurring data extraction tasks.
Integration: The JSON output can be easily integrated into existing workflows or applications. For example, it can be used as a data source for business intelligence dashboards, machine learning models, or data visualization tools.
Data Processing: The structured JSON format allows for easy manipulation and processing of PDF data. Developers can apply various data processing techniques, such as filtering, aggregation, or transformation, to derive insights or generate new data sets.

In conclusion, the PDF-GPT4-JSON project provides a powerful solution for converting PDF files into structured JSON format. By leveraging the capabilities of GPT-4 Vision, it simplifies the extraction and analysis of data from PDFs, opening up new possibilities for developers and data analysts. Whether it's automating data extraction, integrating PDF data into workflows, or performing advanced data processing, the PDF-GPT4-JSON project offers a versatile tool for working with PDFs.

Blog

Use GPT4-Vision for PDF to JSON data extraction

Maximo Guerrero

Converting PDFs to Structured JSON

The Challenge of OCR Data in PDFs

Generating Structured JSON with GPT-4 Vision

Installation and Usage

Applications and Benefits

Join Our Newsletter. No Spam, Only the good stuff.

Related