Scalable Processing of Swiss PDF Documents using 2D Barcodes on AWS
Arlind Nocaj
Posted on November 5, 2021
Abstract
Processing of scanned documents, like tax and salary statements causes a lot of effort, when performed manually by humans. In this post we look in particular at swiss documents like, tax statements, salary statements, and invoices and how to automate their processing in a scalable way using open source and AWS technologies.
Our extraction approach utilizes 2D barcodes and can be easily extended to other documents or images which contain barcodes.
Introduction
In many industries, like Finance, Insurance, Accounting & Tax, documents are still a primary medium of record keeping, communicating and interacting with other parties. Processing of e.g. scanned documents, like tax and salary statements causes a lot of effort and is error-prone when performed manually by humans.
We show how to build a simple and scalable solution which can process swiss documents, like Zurich tax statements, salary statements, and QR invoices using 2D barcodes. In particular, we extend Document understanding solution (DUS) from AWS Labs to support the processing of these swiss document types by using 2D barcodes. Figure 1 shows two supported example documents, a swiss salary statement and a Zurich tax statement.
In the next section we show how to extend an existing sample solution for our specific case of processing documents with 2D barcodes. After reviewing the barcode extraction we show the user interface in the demo section. Then we discuss the scalability of the system and review the supported documents. Finally, we conclude in last section.
Understanding Documents with Barcodes
DUS - Document understanding solution is an example application from AWS Labs which illustrates how to utilize multiple AI/Machine Learning capabilities to solve various document processing and understanding cases. It shows how to combine and leverage the power of various AWS technologies like Amazon Textract and Amazon Comprehend to extract information from structured and unstructured scanned pdf documents or images.
DUS works as follows:
- A document (pdf or image) gets upload to Amazon S3, which triggers an event and the file gets added to a DynamoDB table for state management.
- A change to DynamoDB table triggers an event, which gets processed by an AWS Lambda function and adds the file to the appropriate sync queue (image) and async queue (pdf).
- The messages in each queue get processed by a lambda function which calls Amazon Textract and other services for information extraction and storage.
The architecture, see Figure 2, and its resources are being defined with AWS Cloud Development Kit (AWS CDK). AWS CDK is an open source software development framework which allows to define application resources using familiar programming languages.
To extend the solution with an additional pipeline for barcode processing we have to do the following
- Add a lambda service for barcode processing which can read a pdf document from S3 and write the extracted barcode data to a specific S3 bucket.
- Connect the barcode lambda service to the existing pipeline by extending the AWS CDK components in DUS.
- Add additional tab to the frontend to support preview of barcode extraction results.
The barcode processing extension from Figure 2 (Steps 1 to 4) can be mapped directly to human readable infrastructure as code using CDK:
1) Create Amazon Simple Queue Service for Barcode Processor
const syncBarcodeJobsQueue = new sqs.Queue(this,
this.resourceName("SyncBarcodeJobs"), {
visibilityTimeout: cdk.Duration.seconds(900),
retentionPeriod: cdk.Duration.seconds(1209600),
encryption: QueueEncryption.KMS_MANAGED
});
2) Add queue as event source to AWS Lambda function for Barcode Extraction
syncBarcodeProcessor.addEventSource(
new SqsEventSource(syncBarcodeJobsQueue, {
batchSize: 1
})
);
3) Create AWS Lambda function for sync Barcode Processor by using a Dockerfile. That way we can run the image as well locally and debug the code if required using an IDE, like PyCharm.
// Configure path to Dockerfile
const dockerfile = path.join(__dirname, "../lambda/barcodeprocessor");
// Create AWS Lambda function and push image to ECR
const syncBarcodeProcessor = new lambda.DockerImageFunction(this,
this.resourceName("SyncBarcodeProcessor"), {
code: lambda.DockerImageCode.fromImageAsset(dockerfile),
memorySize: 5024,
timeout: cdk.Duration.minutes(2),
tracing: lambda.Tracing.ACTIVE,
environment: {
OUTPUT_BUCKET: documentsS3Bucket.bucketName,
OUTPUT_TABLE: outputTable.tableName,
DOCUMENTS_TABLE: documentsTable.tableName
}
});
4) Grant read and write permissions to the lambda function so that it can read the documents and store the results
// Permissions for barcode processor
documentsS3Bucket.grantReadWrite(syncBarcodeProcessor);
outputTable.grantReadWriteData(syncBarcodeProcessor);
documentsTable.grantReadWriteData(syncBarcodeProcessor);
Barcode Extraction
We use the docbarcodes package for the barcode extraction process. The extraction consists of multiple steps, since a document can have multiple barcodes distributed over various regions of the page.
The main steps of the barcode extraction are:
Detect candidate barcode regions: An image transformation heuristic based on morphological operations is being used to identify rectangular dark regions, which might contain a barcode.
Extract the raw barcode from candidate regions: The extraction of barcodes from the candidate regions is being performed with zxing, an open source library which supports many variations of 1D and 2D barcodes.
Combine multiple raw barcodes to decode the data: Depending on the document, the data might not fit into one single barcode. Therefore it typically is split across multiple barcodes. These chunks need to be combined in the right order and then a decompression, e.g. like zlib, needs to be applied to receive the original information. An example of a final decoded output, e.g. for the swiss salary statement, can be found below.
The following code shows the main part of the AWS Lambda Barcode Processing function in python. The full code will be available soon in the DUS dev branch. You can preview it at the forked branch DUS dev barcodes
// AWS Lambda function to extract barcodes from PDF documents or images
from docbarcodes.extract import process_document
def processPDF(documentId, features, bucketName, outputBucketName,
objectName, outputTableName, documentsTableName):
with tempfile.TemporaryDirectory() as d:
target = os.path.join(d, objectName)
os.makedirs(Path(target).parent, exist_ok=True)
s3.download_file(bucketName, objectName, target)
# extract raw and decoded barcodes from document
barcodes_raw, barcodes_combined = process_document(target, max_pages=None, use_jpype=True)
# create proper json response
raw_dict = [b._asdict() for b in barcodes_raw]
combined_dict = [b._asdict() for b in barcodes_combined]
response = {"BarcodesRaw": raw_dict, "BarcodesCombined": combined_dict}
# store result
outputPath = '{}{}/{}{}barcodes.json'.format(PUBLIC_PATH_S3_PREFIX, documentId,
SERVICE_OUTPUT_PATH_S3_PREFIX,
BARCODES_PATH_S3_PREFIX)
print("Generating output for DocumentId: {} and storing in {}".format(documentId, outputPath))
S3Helper.writeToS3(json.dumps(response, ensure_ascii=False), outputBucketName, outputPath)
ds = datastore.DocumentStore(documentsTableName, outputTableName)
ds.markDocumentComplete(documentId)
Barcode JSON Result
The generated json result is shown below. The raw
field can contain binary data in ISO-8859-1 format.
The decoded content is shown in BarcodesCombined
.
{
"BarcodesRaw": [
{
"page": 0,
"num_candidate": 2,
"raw": "\u00b4c\u00fd\u00b8z\u0000\u0002V\u0001\u0003\u0000\u0001\u0000\u0001PK\u0003\u0004\u0014\u0000\b\b\b\u0000\u00e9v\u0003Q\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000txabM\u0092\u00ddn\u00e20\u0010\u0085_\u00c5\u00f2}b;\u00a1!\u00ac\u001cW\u00e1\u00afEbY\u00d4\u00d0j\u00d5;\u0013\u00dc\u00105\u00b1+\u00db,\u00f0>\u00fb&\u00fbb;\t\u00d0\u00f6\"\u00ce\u00f1\u0099\u0019\u00cf\u00e7\u0091\u00f9\u00fd\u00a9m\u00d0\u001fe]mt\u0086YH1R\u00ba4\u00bbZW\u0019~\u00de\u00cc\u0083\u0014\u00df\u000b\u00beA\u0090\u00a6]\u0086\u00f7\u00de\u007f\u00fc \u00e4x<\u0086\u00eeX;\u00b7SeX\u00ee\u0089+\u00f7\u00aa\u0095\u00c4\u00edHD#J\u00a3\u0088\u0092B6\u00d2\u009e\u00a7\u00aa\u0084\u009f\u00f4p\u00fa\u00e6\u0094\u008f1*\u0016\u00d3\fS\nm\u008a\u00b3{\u00e9$\u00c3\u0082OL\u00fb!\u00f5\u0019=/\u00a6\u00c1x^dx\u00f28\u000bX\u0014\u0087\u00d7\u000f\u00a35 v\u0084kyh$Z)\u00e7\u001be1z|\n\u009e&\u00c1J\u00b6\nj~\u00fd\\/g\u00bf\u00d1\u00acQ\u00ef\u00de\u001a]\u00bf\u00a3\u00fc\u0001\u00a3\u00d7\u00c5:\u00c3q\u00d2\u00b5\u009c,3\u009co\u00bd\u00aa\u009b\u0083\u00aeP\u00e1\u00d5AYPJ\u0003\u008d\u00b7J\u00f9\f\u00cf\u00eaJ\u00d9\u00a3\u00aaP\u0002=\u008d\u00f3[s\u0082\u00b1D\u0003(\u00ae\u00fd9\u00c3\u009b\u00fd\u0001\u00b2\u00d7{\u00a3\u00a1#\u008dc\u0014\u00c5)\u001a\u008c\u00d0\u0090a\"\u00f8\u0085r1EK\u00e9\u00bc\u00ee\u00a9r\u00b5\u00ad1\u009a\u00d7\u00f6\u00d3\u00d0Z\u00de\u00a8\u00e8\u008d\u00ea\u000b`\u00f9\u00ef\u00af\u00ae*\u00e9\u009c\u00f3\u0016V\u0085\u00a2\u00ef$\u00c9\b\n\u0096\u00a6\u0094MOs\u00a3\u001a+\u00ab\u00d1\bv\u00e6\u00a0\u00bd\u00ed\u0002\u0082\u0017/A^\u0004++\u00ba\u0019\u000e\u00ee\u0092a\u0098\u008e(\u000b\u00a3\u0094\u0093\u00cf\u0010'7b\u00c1s\u00c1\u00a7\u00a6\u0004\u00c58\u00b9\u0088\u00ee:\u00b5\u00d9\t\u00fefM+\"\u00ca\u0092\u0080\u00d1\u0080B\u00bc78\u00f4\u00aa\u009b\u00ab\u00cf\u0082\u0098rrq\u00faS\u00fb\u00c2\u0005<\u00a5V\u0089A\u00caX\u001a\u000e!\u00e1j\u00f0\u0007k\u009c\u00bbn\u0092\u0014\u00c6\u0010R\u0088~w\u00f9J\u00f9\u00ab\u00bcK\u00e2A\u00dc'|y\u009c\u0000/\u00d9\u0088\u00ffPK\u0007\bK\u00a6nJ\u00ce\u0001\u0000\u0000\u00c1\u0002\u0000\u0000PK\u0001\u0002\u0014\u0000\u0014\u0000\b\b\b\u0000\u00e9v\u0003QK\u00a6nJ\u00ce\u0001\u0000\u0000\u00c1\u0002\u0000\u0000\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000txabPK\u0005\u0006\u0000\u0000\u0000\u0000\u0001\u0000\u0001\u00002\u0000\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0000",
"format": "PDF_417",
"points": [...],
"resultMetadata": {
"ERROR_CORRECTION_LEVEL": "2",
"PDF417_EXTRA_METADATA": {...}
}
}
],
"BarcodesCombined": [
{
"content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><T xmlns=\"http://www.swissdec.ch/schema/sd/20200220/SalaryDeclarationTxAB\" SID=\"000\" SysV=\"001\"><Company UID-BFS=\"CHE-123.123.123\" Person=\"Paula Nestler\" HR-RC-Name=\"COMPLEX Elektronik AG\" ZIP=\"3600\" CL=\"Abteilung Steuerungen\" Street=\"Eigerweg 6\" Postbox=\"124\" City=\"Thun\" Phone=\"033 238 49 71\"/><PersonID Lastname=\"Aebi\" Firstname=\"Anna\" ZIP=\"3000\" CL=\"\" Street=\"L\u00c3\u00a4nggassstrasse 26\" Postbox=\"690\" Locality=\"\" City=\"Bern 9\" Country=\"\"><SV-AS-Nr>123.4567.8901.28</SV-AS-Nr></PersonID><A><DocID>1</DocID><Period><from>2016-10-01</from><until>2016-11-30</until></Period><Income>48118.70</Income><GrossIncome>68000.00</GrossIncome><NetIncome>56343.00</NetIncome></A></T>",
"format": "PDF_417",
"sources": [0]
}
]
}
Decoded Barcode Content
The decoded content from the Salary statement contains structured xml with all the non empty fields from the statement.
<?xml version="1.0" encoding="UTF-8"?>
<T SID="000" SysV="001" xmlns="http://www.swissdec.ch/schema/sd/20200220/SalaryDeclarationTxAB">
<Company CL="Abteilung Steuerungen" City="Thun" HR-RC-Name="COMPLEX Elektronik AG" Person="Paula Nestler" Phone="033 238 49 71" Postbox="124" Street="Eigerweg 6" UID-BFS="CHE-123.123.123" ZIP="3600"/>
<PersonID CL="" City="Bern 9" Country="" Firstname="Anna" Lastname="Aebi" Locality="" Postbox="690" Street="Langgassstrasse 26" ZIP="3000">
<SV-AS-Nr>123.4567.8901.28</SV-AS-Nr>
</PersonID>
<A>
<DocID>1</DocID>
<Period>
<from>2016-10-01</from>
<until>2016-11-30</until>
</Period>
<Income>48118.70</Income>
<GrossIncome>68000.00</GrossIncome>
<NetIncome>56343.00</NetIncome>
</A>
</T>
Demo Preview
Figure 3 shows a preview of the demo for the Zurich tax statement. On the right we can see multiple detected barcode chunks and the final decoded xml content.
Scalability Discussion
The overall solution is event based and uses AWS Lambda, Amazon DynamoDB, and Amazon Textract, as shown in Figure 2. We revisit the scaling limitations and concurrency quotas of these services and how they can be overcome for particular use cases where higher throughput would be required. Please note that by default even if these quotas are reached, requests will be saved and processed after some time, when there is capacity.
Amazon DynamoDB streams is being used to trigger a Lambda function when a new file gets upload for processing. Lambda polls shards in your DynamoDB stream for records at a base rate of 4 times per second. It received batches of records and forwards them to the Lambda function. If the default settings are not enough, one can increase the throughput by increasing the batch size (up to 10K) and increase the ParallelizationFactor
as described in the AWS Lambda with AmazonDynamoDB documentation
AWS Lambda is designed to provide scaling capabilities without any custom engineering in your code. As traffic increases, Lambda increases the number of concurrent executions of your functions. For an initial burst of traffic, the cumulative concurrency in a Region can reach between 500 and 3000 per minute, depending upon the Region. After this initial burst, functions can scale by an additional 500 instances per minute. There is also a default concurrent limit of 1000 per Region, for each AWS account, which can be increased by submitting a request in the AWS Support Center.
If there is more documents coming in at the same time than there is concurrency capacity, these documents will be lying on the queue (SQS) until they get processed.
There is also the possibility to use Provisioned Concurrency if a higher concurrency needs to be guaranteed for a particular use case.
Amazon Textract has service quotas, ranging from 2 to 10 requests per second, depending on the region. If these concurrency quotas are not enough for your particular use case, you can request an increase of these soft limits from the AWS console.
Supported Documents
While our focus was to show how to extract information from swiss tax and salary statements, the same approach can be utilized for other document types like:
- Swiss Invoices (QR Code), introduced by Six-group
- Swiss and European Covid Certificate
- US drivers licenses with MRZ (machine readable PDF417 zone)
Ideally the public authorities related to these documents would publish the defined format for each document as part of an open government strategy, which would promote innovative business creation and innovation for these public services.
For Switzerland at least the following cantons utilize barcodes to encode information for tax declaration and processing: ZH-Zurich, VS-Valais, VD-Vaud, LU-Lucerne, AG-Aargau, TG-Thurgau. This solution can easily be extended to support all of these cantons.
Conclusion
We showed how to extract structured information from various types of swiss documents utilizing 2D barcodes and how to embed this barcode processing in a scalable event-based architecture utilizing AWS Lambda.
Our extension of DUS (Document Understanding Solution) for barcodes allows to deploy a full end-to-end solution using CDK and its CLI commands. The Demo allows to apply the barcode processing on all kind of documents and can be utilized as a starting point to increase automation in document processing workflows.
Posted on November 5, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 5, 2021