Build a serverless EU-Driving Licences OCR with Amazon Textract on AWS

TL;DR

In this article, we will develop a serverless micro-service equipped with OCR capabilities specifically tailored for EU-Driving Licences, enabling seamless integration into any digital product.

We will leverage an AWS Lambda function which will invoke Amazon Textract to scan documents uploaded to Amazon S3, subsequently storing the extracted data in Amazon DynamoDb.

Prerequisites

If you're new to serverless development, I recommend reading my article as a foundational resource. It provides a comprehensive overview of serverless architectures and introduces devops best practices. This will serve as an excellent starting point for grasping the concepts we'll be implementing in the upcoming steps.

What is Amazon Textract?

Amazon Textract is a ML-powered OCR (Optical Character Recognition) designed to efficiently extract data from various sources, including PDFs, images, and handwritten text.

The best of this is that you don't have to be a ML-expert to use it as it does the job for you: it's a SaaS OCR with which you can interact by provided API.

Architecture

Let's delineate the fundamental behavior of our OCR system:

Upon file upload, the OCR process should be triggered.
The system should identify EU-Driving licences from the uploaded file.
Extracted information should be returned and stored in a database.

To architect this solution effectively, we can employ the following components:

Utilize an Amazon S3 bucket as the designated upload destination.
Implement an Amazon S3 Lambda Trigger to initiate an AWS Lambda function upon file upload.
Develop an AWS Lambda function responsible for invoking Amazon Textract.
Employ Amazon Textract as the core OCR tool for data extraction.
Employ Amazon DynamoDB as the target database to persist the retrieved information.

Invoker Function IaC

The pivotal aspect of this function Infrustructure As Code lies within the events section, particularly in its configuration to respond to specific triggers.

We've set up the function to be activated upon the occurrence of an s3:ObjectCreated event, specifically targeting uploads to the "input/" prefix and files bearing the ".jpg" suffix. This configuration ensures that the function is selectively triggered only by image uploads.

ocr:
  handler: src/function/document/ocr/index.handler #function handler
  package: #package patterns
    include:
      - "!**/*"
      - src/function/document/ocr/**
  events: #events
    #keep warm event
    - schedule:
        rate: rate(5 minutes)
        enabled: ${strToBool(${self:custom.scheduleEnabled.${env:STAGE_NAME}})}
        input:
          warmer: true
    #S3 event
    - s3:
        bucket: ocr-documents
        event: s3:ObjectCreated:*
        rules:
          - prefix: input/
          - suffix: .jpg

Invoker Function Code

Requirements

Here's a concise summary of the requirements for our OCR function:

Receive event information triggered by an upload to the "input" prefix in Amazon S3.
Utilize Amazon Textract to scan the uploaded image and extract the identity information from EU-Driving Licenses.
Store the extracted information in an Amazon DynamoDB table.
Write a JSON representation of this information to Amazon S3 under the "output" prefix.
Remove the original input image from storage.
Return the extracted identity JSON in the response.

By adhering to these requirements, our OCR function will efficiently process uploaded images, extract relevant identity information, store it appropriately, and provide the necessary response.

As a good practice let's describe it commenting our function code.

/**
 * Main handler: receive event from S3 and return a response
 * This function react to a s3 trigger performing those steps:
 * STEP 1. get bucket and image key from event (this function is triggered by s3 on upload to /input folder)
 * STEP 2. pass image to textract analyzeId API to get identity info
 * STEP 2.1 put identity info on dynamodb
 * STEP 3. write a json and put an object with identity info to s3 (in a /output folder)
 * STEP 4. deleting original input image
 * STEP 5. return textract recognized identity info as response

Invoke Amazon Textract

This function extract bucket and object key names from the event triggered by S3, then leverages this information to interface with Amazon Textract, facilitating the retrieval of files and data via the AnalyzeId API. This functionality seamlessly aligns with our objective of extracting identity data from EU-Driving Licences.

/* STEP 1. Get bucket and image key from event */
    /* Get bucket name (it should be the bucket on which the trigger is active*/
    const bucketName = event['Records'][0]['s3']['bucket']['name'];
    /* Get key name (it should be a jpeg image)*/
    const keyName = event['Records'][0]['s3']['object']['key'];
    /* Log bucket and key names*/
    console.log(bucketName,keyName);

    /* STEP 2. Analyze an image with textract OCR */
    /* Prepare analyzeId command input passing s3 object info got from event*/
    const analyzeIDCommandInput = { // AnalyzeIDRequest
        DocumentPages: [ // DocumentPages // required
            { // Document
                S3Object: { // S3Object
                    Bucket: bucketName,
                    Name: keyName,
                },
            },
        ],
    };

    /* Execute analyzeId command with textract and get recognized info in a response */
    const analyzeIDCommand = new AnalyzeIDCommand(analyzeIDCommandInput);
    const analyzeIDCommandResponse = await textractClient.send(analyzeIDCommand);
    /* Log textract response */
    console.log(analyzeIDCommandResponse);

As evident, all we need to do is dispatch an analyzeIdCommand to Amazon Textract, indicating the file's location through the bucketName and keyName parameters. From there, Textract seamlessly handles the OCR task for us, condensing what would typically be a multitude of lines of code into a streamlined process.

Parse automatic detected fields

Amazon Textract excels in automatically recognizing identity-related information. Upon executing our analyzeIdCommand, we receive the "Identity Document Response" as a JSON format. This response contains standardized identity fields, including those of our particular interest such as first name, last name, document number, and expiration date.

Parsing this JSON response is straightforward, enabling us to extract the identity fields effortlessly and store them within an identity object for further processing.

/**
 * Extract identity fields from document fields portion of analyzed document via Textract
 * which is able to return auto identified document fields of EU patent
 * In Textract response under IdentityDocumentFields section there are:
 * FIRST_NAME, which identify the name
 * LAST_NAME, which identify the surname
 * DOCUMENT_NUMBER, which identify the patent number
 * Those fields have got a confidence which we need to be more than 95%
 * @param identityDocument
 * @param identity
 */
function extractFromIdentityDocumentFields(identityDocument, identity) {
    /* Cycle fields */
    for (let j = 0; j < identityDocument.IdentityDocumentFields.length; j++) {
        /* Get field */
        const identityDocumentField = identityDocument.IdentityDocumentFields[j];
        /* If name, surname or document number are not empty and confidence is upper than 95 */
        if (
            (
                identityDocumentField.Type.Text === 'FIRST_NAME' //if type FIRST_NAME
                || identityDocumentField.Type.Text === 'LAST_NAME' //if type LAST_NAME
                || identityDocumentField.Type.Text === 'DOCUMENT_NUMBER' //if type DOCUMENT_NUMBER
                || identityDocumentField.Type.Text === 'EXPIRATION_DATE' //if type DOCUMENT_NUMBER
            )
            && identityDocumentField.ValueDetection.Confidence >= 95 // if confidence is more than 95%
            && identityDocumentField.ValueDetection.Text !== '' // if text is not empty
        ) {
            /* Set name, surname or document number in identity */
            identity[identityDocumentField.Type.Text]['text'] = identityDocumentField.ValueDetection.Text;
            //set as document-field to say we recognized it via document fields parsing
            identity[identityDocumentField.Type.Text]['type'] = 'document-field';
            identity[identityDocumentField.Type.Text]['confidence'] = identityDocumentField.ValueDetection.Confidence;
        }
        /* Exit if name,surname,expiration date and document number have been found */
        if (
            identity.FIRST_NAME['text']
            && identity.LAST_NAME['text']
            && identity.DOCUMENT_NUMBER['text']
            && identity.EXPIRATION_DATE['text']
        ) {
            break;
        }
    }
}

Fine tuning with text detection

Furthermore, within the response of the analyzeIdCommand, we encounter the "Text detection and document analysis response". This JSON encapsulates all content identified by Amazon Textract through conventional OCR methods.

This resource serves as a valuable fallback option. It becomes particularly useful in scenarios where Amazon Textract might not automatically recognize the required information or when the identity details are identified with a confidence level below our specified threshold, set at 95% in this instance.

/**
 * Extract identity fields from block portion of analyzed document via Textract
 * This is a fallback if document fields have not been identified by Textract
 * In this case Textract returns Blocks, an array of blocks
 * with each identify a page, line or a word
 * As EU patent have got a strict format in which:
 * Statement "1." identifies surname (last name)
 * Statement "2." identifies name (first name)
 * Statement "5." identifies patent number (document number)
 * this function search for those patterns to identify information in the block in
 * which those patterns are present (or the subsequent ones)
 * @param identityDocument
 * @param identity
 */
function extractFromIdentityDocumentBlocks(identityDocument,identity) {
    /* If any of name, surname or document number is empty */
    if (
        !identity.FIRST_NAME['text']
        || !identity.LAST_NAME['text']
        || !identity.DOCUMENT_NUMBER['text']
    ) {
        /* Cycle blocks*/
        for (let j = 0; j < identityDocument.Blocks.length; j++) {
            /* Get the block */
            const block = identityDocument.Blocks[j];

            /* Check for "1. " as the last name in EU patent */
            /* If present and last name has not been set in identity, name should be after this */
            parseBlock(block,'LINE',j,'this','1. ',identity,'LAST_NAME',identityDocument);
            /* Check for "1." as the last name in EU patent */
            /* If present and last name has not been set, name should be the text in sequent block*/
            parseBlock(block,'LINE',j+1,'next','1.',identity,'LAST_NAME',identityDocument);

            /* Check for "2. " as the surname in EU patent */
            /* If present and surname has not been set in identity, name should be after this */
            parseBlock(block,'LINE',j,'this','2. ',identity,'FIRST_NAME',identityDocument);
            /* Check for "2." as the surname in EU patent */
            /* If present and surname has not been set, name should be the text in sequent block*/
            parseBlock(block,'LINE',j+1,'next','2.',identity,'FIRST_NAME',identityDocument);

            /* Check for "4b. " as the expiration date in EU patent */
            /* If present and expiration date has not been set in identity, name should be after this */
            parseBlock(block,'LINE',j,'this','4b. ',identity,'EXPIRATION_DATE',identityDocument);
            /* Check for "2." as the surname in EU patent */
            /* If present and expiration date has not been set, name should be the text in sequent block*/
            parseBlock(block,'LINE',j+1,'next','4b.',identity,'EXPIRATION_DATE',identityDocument);

            /* Check for "5. " as document number in EU patent */
            /* If present and document number has not been set in identity, name should be after this */
            parseBlock(block,'LINE',j,'this','5. ',identity,'DOCUMENT_NUMBER',identityDocument);
            /* Check for "5." as the document number in EU patent */
            /* If present and document number has not been set, name should be the text in sequent block*/
            parseBlock(block,'LINE',j+1,'next','5.',identity,'DOCUMENT_NUMBER',identityDocument);
            // Exit if name,surname,expiration date and document number have been found
            if (
                identity.FIRST_NAME['text']
                && identity.LAST_NAME['text']
                && identity.DOCUMENT_NUMBER['text']
                && identity.EXPIRATION_DATE['text']
            ) {
                break;
            }
        }
    }
}

As evident, parsing the Blocks is straightforward, especially considering our specific case where we are familiar with the standard "layout" of our document.

Takeways

I've omitted the full code of our solution here as it's both simply and thoroughly documented. What's crucial to emphasize is that Amazon Textract offers optimized functionality for scanning identity documents. Additionally, it serves as a standard OCR tool, providing the flexibility to refine results when you're familiar with the specific "layout" of your input document. This versatility ensures adaptability to a wide range of use cases.

Policy

Our Lambdas require policy statements to be attached to the role used at runtime. For convenience, we'll add them under the provider section. However, it's also possible to add them specifically for each Lambda function if needed. This flexibility allows for tailored access control and permissions management, ensuring that each function has precisely the permissions it requires without unnecessary access.

iam:
  role:
    statements:
      iam:
  role:
    statements:
      # Allow functions to use s3
      - Effect: Allow
        Action: 
          - 's3:ListBucket'
          - 's3:PutObject'
          - 's3:DeleteObject'
        Resource: 
          - 'arn:aws:s3:::ocr-documents/*'
      # Allow functions to use textract
      - Effect: Allow
        Action: 'textract:*'
        Resource: '*'
      # Allow functions to use dynamodb
      - Effect: Allow
        Action: 'dynamodb:PartiQLInsert'
        Resource: '*'

🏁 Final Thoughts

We've explored the process of constructing an OCR solution with Amazon Textract, seamlessly integrating it with other essential services like Amazon EventBridge, Amazon S3, and Amazon DynamoDB.

This framework serves as an excellent starting point for any OCR use case. It presents a loosely coupled micro-service that seamlessly integrates into your architecture, requiring nothing more than an uploaded file as input. Such modular design ensures flexibility and scalability, allowing easy adaptation to diverse requirements and environments.

🌐 Resources

You can find a skeleton of this architecture open sourced by Eleva here.

🏆 Credits

A heartfelt thank you to my colleagues:

L. Viada, G. Blanc and folks at RevoDigital which are using this micro-service in several production projects
A. Fraccarollo and, again, A. Pagani, as the co-authors of CF files and watchful eyes on the networking and security aspect.
C. Belloli and L. Formenti to have pushed me to going out from my nerd cave.
L. De Filippi for enabling us to make this repo Open Source and explain how we developed this micro-service.

We all believe in sharing as a tool to improve our work; therefore, every PR will be welcomed.

🙋 Who am I

I'm D. De Sio and I work as a Solution Architect and Dev Tech Lead in Eleva.
I'm currently (Apr 2024) an AWS Certified Solution Architect Professional, but also a User Group Leader (in Pavia) and, last but not least, a #serverless enthusiast.
My work in this field is to advocate about serverless and help as more dev teams to adopt it, as well as customers break their monolith into API and micro-services using it.

Blog