Unlocking Rapid Data Extraction: Groq + OCR and Claude Vision

Introduction

In this article, we explore various methods for extracting data from documents, comparing OCR+LLM with Claude 3 Vision, and delving into fast OCR transformers and cloud-native OCRs. We also provide a code example for implementing OCR as a simple API using docTR and discuss how Groq can be leveraged to achieve the best inference speed for LLMs.

The Use Case: Document Scanning to Save Time

Imagine a SaaS platform that helps register invoices for a company. Speed and convenience are paramount, and while some errors are tolerable, the goal is to minimize them. This scenario highlights the need for rapid and reliable data extraction.

Claude 3 Vision vs OCR+LLM

Claude 3 Vision

Claude 3 Vision is known for its speed and cost-efficiency. However, it has limitations, including a tendency to hallucinate (produce errors). It's suitable for simple tasks but may fall short in more complex scenarios.

OCR+LLM

OCR+LLM combines Optical Character Recognition (OCR) with Large Language Models (LLMs) to extract and analyze text. This approach offers a balance between accuracy and speed, making it ideal for more detailed data extraction tasks.

Testing Limits with Claude 3 Vision

Using an example invoice, we can define a protocol for our application:

invoice_number: "string"
invoice_date: "string"  # YYYY-MM-DD
due_date: "string"  # YYYY-MM-DD
seller_details:
  seller_name: "string"
  seller_address:
    street_number_and_name: "string"
    city_or_town: "string"
    country: "string"
buyer_details:
  buyer_name: "string"
  buyer_address:
    street_number_and_name: "string"
    city_or_town: "string"
    country: "string"
  buyer_email: "string"
  buyer_phone_number: "string"
products_services:
  - item_number: number
    description: "string"
    quantity: number
    unit_price: number
    total_price: number
sub_total: number
total: number

This pseudo-YAML format outlines the fields we want to extract from an invoice. Testing with Claude 3 Vision yielded response times of about 1 second, which is slower than desired.

OCR Transformers Designed for Speed

Notable OCR Tools

DocTR: Optimized for high-speed performance on both CPU and GPU, requiring only three lines of code to implement.
TrOCR: Pre-trained transformers supported by Microsoft, offering various models.
PaddleOCR: Known for its speed, capable of processing large volumes of images in real-time.
MMOCR: Another fast OCR tool.
Surya: Highly efficient and fast.

Performance testing showed that these OCR tools could achieve processing times as low as 20ms on a GPU.

Cloud-Native OCRs

Azure Form Recognizer: Best performance time around 3 seconds.
Amazon Textract: Processes documents in 3-4 seconds per page.
Google Cloud Vision API and Document AI: Highly efficient and similar to Azure and Amazon.
Abby Cloud OCR: Faster than the other alternatives and offers detailed page representations.

These cloud AI services used to be the go-to solutions but are now often replaced by LLMs due to cost and flexibility advantages.

(https://miro.medium.com/v2/resize:fit:720/format:webp/1*xlWNvAtaM0ObnSKYz6MWwA.png)

Implementing OCR with docTR

Setting Up the OCR API

Here’s a simple example using docTR:

from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from PIL import Image
import io

app = FastAPI(title="OCR Service using docTR")

@app.post("/ocr/")
async def perform_ocr(file: UploadFile = File(...)):
    image_data = await file.read()
    doc = DocumentFile.from_images(image_data)
    model = ocr_predictor(pretrained=True)
    result = model(doc)

    extracted_texts = []
    for page in result.pages:
        for block in page.blocks:
            for line in block.lines:
                line_text = ' '.join([word.value for word in line.words])
                extracted_texts.append(line_text)

    return JSONResponse(content={"ExtractedText": extracted_texts})

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8001)

Docker Setup for the OCR API

# Use an official Python runtime as a parent image, suitable for TensorFlow
FROM tensorflow/tensorflow:latest

# Set the working directory in the container
WORKDIR /app

# Install system dependencies required for OpenCV and WeasyPrint
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libpango-1.0-0 \
    libpangocairo-1.0-0 \
    libgdk-pixbuf2.0-0 \
    libffi-dev \
    shared-mime-info

# Install FastAPI and Uvicorn
RUN pip install fastapi uvicorn python-multipart aiofiles Pillow

# Copy the local directory contents into the container
COPY . /app

# Install `doctr` with TensorFlow support
RUN pip install python-doctr[tf]

# Expose the port FastAPI will run on
EXPOSE 8001

# Command to run the FastAPI server on container start
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8001", "--workers", "4"]

(https://miro.medium.com/v2/resize:fit:720/format:webp/0*OeNLrlTN_iNgWcvX)

Groq: The King of Speed

Groq's architecture provides exceptional performance and cost efficiency, boasting speeds three times faster at half the cost compared to traditional methods.

Using Groq with OCR

from groq import Groq

def send_request_to_groq(content: str) -> str:
    client = Groq(api_key=API_KEY_GROQ)
    completion = client.chat.completions.create(
        model="gemma-7b-it",
        messages=[
            {
                "role": "system",
                "content": "You are an API server that receives content from a document and returns a JSON with the defined protocol"
            },
            {
                "role": "user",
                "content": content
            }
        ],
        temperature=1,
        max_tokens=1024,
        top_p=1,
        stream=False,
        response_format={"type": "json_object"},
        stop=None,
    )

    return completion.choices[0].message.content

The response_format feature of Groq is particularly noteworthy, offering unique capabilities not found in other providers.

Final Implementation

Controller Code

@app.post("/extract_fast")
async def extract_text(file: UploadFile = File(...), extraction_contract: str = Form(...)):
    temp_file = tempfile.NamedTemporaryFile(delete=False)
    shutil.copyfileobj(file.file, temp_file)
    file_path = temp_file.name

    images = convert_pdf_to_images(file_path)

    extracted_text = extract_text_with_pytesseract(images)
    extracted_text = "\n new page --- \n".join(extracted_text)
    extracted_text = systemMessage + "\n####Content\n\n" + extracted_text
    extracted_text = extracted_text + "\n####Structure of the JSON output file\n\n" + extraction_contract
    extracted_text = extracted_text + "\n#### JSON Response\n\n" + jsonContentStarter

    start_time = time.time()
    content = send_request_to_groq(extracted_text)
    elapsed_time = time.time() - start_time
    print(f"send_request_to_groq took {elapsed_time} seconds")

    temp_file.close()
    content = remove_json_format(content)

    return json.loads(content)

Conclusion

For document scanning and data extraction, combining OCR and LLMs on GPUs with Groq provides superior speed and efficiency. This approach is especially beneficial for processing invoices and other documents captured via mobile devices.

Blog