Converting PayStub to SQL with Azure Form Recognizer

paulriviera

Paul Riviera

Posted on February 7, 2023

Converting PayStub to SQL with Azure Form Recognizer

Why

Like many people I set a series of goals in the beginning of the year. For 2023, one goal was to more accuratly measure aspects of my life, in this case my finances. I have tried using numerous financial tracking apps and I have never been pleased, if anyone has a recommendation please let me know. I decided to take a different approach and try to build my own solution; to start, I am putting the input (income from job) into a sql database so I can being to analyze it. The code outlined here is the first step in that process, you can find the complete codebase on my GitHub.

Begin by provisioning Azure resources

The scripts below prevision the following Azure resources:

NOTE: I choose to use Azure Key Vault to store my Azure Form Recognizer key and endpoint, but you can use any key management method you choose.

Create Azure Form Recognizer with Azure Key Vault

I began with provisioning the Azure Form Recognizer, the code for which is below.

Param(
    [Parameter(Mandatory = $true)]
    [String]
    $ResourceGroupName,
    [Parameter(Mandatory = $true)]
    [String]
    $FormRecognizerName,
    [Parameter(Mandatory = $true)]
    [String]
    $KeyVaultName,
    [Parameter(Mandatory = $true)]
    [String]
    $Location
)

# ------------------------------------------------------------------------------
# Variables
# ------------------------------------------------------------------------------
$RESOURCE_GROUP_NAME = $ResourceGroupName
$FORM_RECOGNIZER_ACCOUNT = $FormRecognizerName
$KEY_VAULT_NAME = $KeyVaultName

# ------------------------------------------------------------------------------
# Provision Resource Group
# ------------------------------------------------------------------------------
az group create `
    --name $RESOURCE_GROUP_NAME `
    --location $LOCATION

# ------------------------------------------------------------------------------
# Provision Azure Key Vault
# ------------------------------------------------------------------------------

az keyvault create `
    --name $KEY_VAULT_NAME `
    --resource-group $RESOURCE_GROUP_NAME `
    --location $LOCATION

# ------------------------------------------------------------------------------
# Provision Azure Form Recognizer
# ------------------------------------------------------------------------------

$FORM_RECOGNIZER_ACCOUNT_ENDPOINT = az cognitiveservices account create `
    --kind "FormRecognizer" `
    --name $FORM_RECOGNIZER_ACCOUNT `
    --resource-group $RESOURCE_GROUP_NAME `
    --location $LOCATION `
    --sku "S0" `
    --assign-identity `
    --yes `
    --query "properties.endpoint" `
    --output tsv

$FORM_RECOGNIZER_ACCOUNT_KEY = az cognitiveservices account keys list `
    --name $FORM_RECOGNIZER_ACCOUNT `
    --resource-group $RESOURCE_GROUP_NAME `
    --query "key1" `
    --output tsv

# ------------------------------------------------------------------------------
# Store Azure Form Recognizer Keys in Vault
# ------------------------------------------------------------------------------

az keyvault secret set `
    --vault-name $KEY_VAULT_NAME `
    --name "FormRecognizerEndpoint" `
    --value $FORM_RECOGNIZER_ACCOUNT_ENDPOINT

az keyvault secret set `
    --vault-name $KEY_VAULT_NAME `
    --name "FormRecognizerKey" `
    --value $FORM_RECOGNIZER_ACCOUNT_KEY
Enter fullscreen mode Exit fullscreen mode

Create Azure SQL Server and Database

Next step is to provision the SQL Database, the script for which is below. If you choose to copy this script and run it in a pipeline, GitHub Actions as an example, you will need to update the external admin on the SQL Server. Similarly the Client IP Range should not be added when run in a pipeline, but rather when you are running the script locally. Feel free to modify the script to suit your needs.

Param(
    [Parameter(Mandatory = $true)]
    [String]
    $ResourceGroupName,
    [Parameter(Mandatory = $true)]
    [String]
    $SqlServerName,
    [Parameter(Mandatory = $true)]
    [String]
    $SqlDatabaseName,
    [Parameter(Mandatory = $true)]
    [String]
    $ClientIPStart,
    [Parameter(Mandatory = $true)]
    [String]
    $ClientIPEnd,
    [Parameter(Mandatory = $true)]
    [String]
    $Location
)

# ------------------------------------------------------------------------------
# Variables
# ------------------------------------------------------------------------------

$RESOURCE_GROUP_NAME = $ResourceGroupName
$LOCATION = $Location

$SQL_SERVER_NAME = $SqlServerName
$SQL_DATABASE_NAME = $SqlDatabaseName

$START_IP = $ClientIPStart
$END_IP = $ClientIPEnd

# ------------------------------------------------------------------------------
# Provision Resource Group
# ------------------------------------------------------------------------------
az group create `
    --name $RESOURCE_GROUP_NAME `
    --location $LOCATION

# ------------------------------------------------------------------------------
# Provision Server (for current signed-in user)
# ------------------------------------------------------------------------------
$SQL_ADMIN_NAME = az ad signed-in-user show `
    --query displayName `
    --output tsv

$SQL_ADMIN_USER_OBJECT_ID = az ad signed-in-user show `
    --query id `
    --output tsv

az sql server create `
    --name $SQL_SERVER_NAME `
    --resource-group $RESOURCE_GROUP_NAME `
    --location $LOCATION `
    --enable-ad-only-auth `
    --external-admin-principal-type User `
    --external-admin-name $SQL_ADMIN_NAME `
    --external-admin-sid $SQL_ADMIN_USER_OBJECT_ID

# ------------------------------------------------------------------------------
# Configure a server-based firewall rule
# ------------------------------------------------------------------------------
az sql server firewall-rule create `
    --resource-group $RESOURCE_GROUP_NAME `
    --server $SQL_SERVER_NAME `
    --name AllowMyIp `
    --start-ip-address $START_IP `
    --end-ip-address $END_IP

# ------------------------------------------------------------------------------
# Create a database
# ------------------------------------------------------------------------------
az sql db create `
    --resource-group $RESOURCE_GROUP_NAME `
    --server $SQL_SERVER_NAME `
    --name $SQL_DATABASE_NAME `
    --edition GeneralPurpose `
    --compute-model Serverless `
    --family Gen5 `
    --capacity 2
Enter fullscreen mode Exit fullscreen mode

Now that resources are provisioned we can begin to build the application.

Building the application.

I chose to build a console app because it will be easy to integrate into my home automation, but in the future I would like to move it to an Azure Function.

lets review the goal again: to pass in a PDF (because thats what I have) and have the application extract the data and insert it into the SQL database.

Setup the project

I created a new console app using the .NET CLI, but you can use Visual Studio or Visual Studio Code to create a new console app.

dotnet new console --name "<Name of your project>"
Enter fullscreen mode Exit fullscreen mode

NOTE: I suggest also adding a .gitignore for dotnet at the repository root as this command will not create one for you.

Add NuGet packages

dotnet add package Azure.Identity
dotnet add package Azure.AI.FormRecognizer
dotnet add package Azure.Security.KeyVault.Secrets
dotnet add package Microsoft.Data.SqlClient
dotnet add package AzFormRecognizer.Table.ToSQL
Enter fullscreen mode Exit fullscreen mode

Before we begin to write code lets talk through the packages.

The Code

I break out the code into 5 core sections to make it easier to follow. To keep things simple, I will simply be placing these sections one after another in the Program.cs file. Feel free to break these sections into functions or different files as you see fit, afterall in its current form unit tests would be painful.

You can remove the Console.WriteLine("Hello, World!"); line as it is not needed.

Section 1: Ensure the PDF file path is passed in as an argument

Add the the following using statement to the top of the file.

using AzFormRecognizer.Table;
Enter fullscreen mode Exit fullscreen mode

This will allow us to use the DocumentDetails class in this section of code.


if (args.Length == 0) {
    throw new Exception("Please pass in the PDF file path.");
}

string filePath = args[0];

if (!File.Exists(filePath)) {
    throw new Exception("File does not exist.");
}

var details = new DocumentDetails() // This is used later as keys for database tables
{ 
    Title = Path.GetFileName(filePath), 
    Id = Guid.NewGuid().ToString()
};

var bytes = await File.ReadAllBytesAsync(filePath);
var memoryStream = new MemoryStream(bytes);

Enter fullscreen mode Exit fullscreen mode

The above code will ensure that a file path is passed in as an argument and that the file exists. It will then read the file into a MemoryStream which will be used later to send the PDF to the Form Recognizer service.

Section 2: Pull Form Recognizer keys from Azure Key Vault

Add the the following using statement to the top of the file.

using Azure.Identity;
using Azure.Security.KeyVault.Secrets;
Enter fullscreen mode Exit fullscreen mode

This will allow us to use the DefaultAzureCredential and SecretClient classes in this section of code.

string? keyVaultName = Environment.GetEnvironmentVariable("KEY_VAULT_NAME");
if (keyVaultName == null || keyVaultName == "") {
    throw new Exception("Please set the KEY_VAULT_NAME environment variable.");
}

var keyVaultUri = new Uri($"https://{keyVaultName}.vault.azure.net");
var secretClient = new SecretClient(keyVaultUri, new DefaultAzureCredential());

var formRecognizerEndpoint = await secretClient.GetSecretAsync("FormRecognizerEndpoint");
var formRecognizerKey = await secretClient.GetSecretAsync("FormRecognizerKey");
Enter fullscreen mode Exit fullscreen mode

The above code will pull the Form Recognizer endpoint and key from Azure Key Vault. The DefaultAzureCredential will use the Azure CLI to authenticate to Azure. If you are not logged into Azure through the CLI you will need to use a different method to authenticate.

Section 3: Send PDF to Azure Form Recognizer

Add the the following using statement to the top of the file.

using Azure;
using Azure.AI.FormRecognizer;
Enter fullscreen mode Exit fullscreen mode

This will allow us to use the AzureKeyCredential, DocumentAnalysisClient, AnalyzeResult and AnalyzeDocumentOperation classes in this section of code.

string endpoint = formRecognizerEndpoint.Value.Value;
string apiKey = formRecognizerKey.Value.Value;

string modelName = "prebuilt-layout";

AzureKeyCredential credential = new AzureKeyCredential(apiKey);
var documentClient = new DocumentAnalysisClient(new Uri(endpoint), credential);  

AnalyzeDocumentOperation operation = await documentClient.AnalyzeDocumentAsync(WaitUntil.Completed, modelName, memoryStream);
AnalyzeResult result = operation.Value;
Enter fullscreen mode Exit fullscreen mode

The above code will send the PDF to the Form Recognizer service and wait for the results to be returned. The modelName variable is the name of the model that will be used to analyze the PDF. In this case we are using the prebuilt-layout model which will return various structural elements of the PDF including extracted tables. You can read more about it in the documentation.

Section 4: Parse the Form Recognizer results into a list of sql commands

Add the the following using statement to the top of the file.

using AzFormRecognizer.Table.ToSQL;
Enter fullscreen mode Exit fullscreen mode

This will allow you to use the IEnumerable<DocumentTable> extension ToSQL which will output the SQL commands.

void AddMissingInfoAndRelations(List<Table> tables, DocumentDetails details)
{
    if (tables.Count == 0)
    {
        throw new Exception("No tables found in the document.");
    }

    if (tables[0].Title == null && tables[0].Headers != null)
    {
        tables[0].Title = "Payslip";
        var payslipTablePrimaryKey = new ColumnHeader() 
        { 
            Name= "DocumentId",
            TableKey = new TableKey() { Type = TableKeyType.Primary },
            DataType = ColumnDataTypes.VARCHAR
        };

        var primaryKeyColumnIndex = tables[0].Headers.Last().Key + 1;
        tables[0].Headers.Add(primaryKeyColumnIndex, payslipTablePrimaryKey);
    }

    var payslipTableForignKey = new ColumnHeader() 
    { 
        Name= "DocumentId",
        TableKey = new TableKey() { Type = TableKeyType.Foreign, Reference = "Payslip(DocumentId)" },
        DataType = ColumnDataTypes.VARCHAR
    };


    if (tables[1].Title == null)
    {
        tables[1].Title = "Summary";
    }

    if (tables[8].Title == null)
    {
        tables[8].Title = "Allowances";
    }

    foreach (var table in tables)
    {
        if (table.Headers.All(header => header.Value.TableKey == null))
        {
            var primaryKey = new ColumnHeader() { Name= "Id", TableKey = new TableKey() { Type = TableKeyType.Primary }, DataType = ColumnDataTypes.INT };
            table.Headers.Add(table.Headers.Last().Key + 1, primaryKey);
            table.Headers.Add(table.Headers.Last().Key + 1, payslipTableForignKey);
        }

        foreach (var row in table.Rows)
        {
            row.Add("DocumentId", details.Id);
        }
    }
}

var listOfSQLCommands = result.Tables.ToSQL(details, AddMissingInfoAndRelations);
Enter fullscreen mode Exit fullscreen mode

The above code will parse the Form Recognizer results into a list of SQL commands. The AddMissingInfoAndRelations method will add the missing table names and primary keys to the tables. While Form Recognizer will return the table headers it will not return the table name or the relation between the tables, which is why the ToSQL method accepts a delegate to add this information. I have this configured to add in missing segments of my own workday generated playstub in PDF format, but you can modify this to suit the needs of the PDF you are using.

Section 5: Execute the SQL commands

Add the the following using statement to the top of the file.

using Microsoft.Data.SqlClient;
Enter fullscreen mode Exit fullscreen mode

This will allow you to use the SqlConnection and SqlCommand classes in this section of code.

string? serverName = Environment.GetEnvironmentVariable("SQL_SERVER_NAME");
if (serverName == null || serverName == "") {
    Console.WriteLine("Please set the SQL_SERVER_NAME environment variable.");
    return;
}

string? databaseName = Environment.GetEnvironmentVariable("SQL_DATABASE_NAME");
if (databaseName == null || databaseName == "") {
    Console.WriteLine("Please set the SQL_DATABASE_NAME environment variable.");
    return;
}

string ConnectionString = $"Server={serverName}.database.windows.net; Authentication=Active Directory Default; Encrypt=True; Database={databaseName}";

using (SqlConnection connection = new SqlConnection(ConnectionString)) {
    await connection.OpenAsync();

    // First create all tables and relations
    // This will cause an error if run after tables exist, this is currently a limitation of the AzFormRecognizer.Table.ToSQL library
    // which will be fixed in a future release, in the meantime, you can comment out the following line and run the program again for new PDFs
    var createTableCmds = listOfSQLCommands.Where(cmd => cmd.Contains("CREATE TABLE")).ToList();
    foreach (var sqlTableStr in createTableCmds) {
        using (var command = new SqlCommand(sqlTableStr, connection)) {
            int rowsAffected = command.ExecuteNonQuery();
            Console.WriteLine(rowsAffected + " = rows affected.");
        }
    }

    // Insert values into tables
    var inserValuesCmds = listOfSQLCommands.Where(cmd => !cmd.Contains("CREATE TABLE")).ToList();
    foreach (var sqlTableStr in inserValuesCmds) {
        using (var command = new SqlCommand(sqlTableStr, connection)) {
            int rowsAffected = command.ExecuteNonQuery();
            Console.WriteLine(rowsAffected + " = rows affected.");
        }
    }

    await connection.CloseAsync();
}
Enter fullscreen mode Exit fullscreen mode

The above code will create the tables and insert the values into the tables. The ConnectionString variable is the connection string to the Azure SQL Database. You can find the connection string in the Azure Portal under the SQL Database resource. The listOfSQLCommands variable is the list of SQL commands that were generated in the previous section.

Section 6: Run the program

To run the program, you can either run it from Visual Studio or from the command line. To run it from the command line, navigate to the directory where the project is located and run the following command.

dotnet build -o output -c release
Enter fullscreen mode Exit fullscreen mode

This will build the project and output the executable to the output directory. To run the executable, run the following command with your own environment variables, project name and pdf.


$Env:SQL_SERVER_NAME = <Enter your SQL Server Name>
$Env:SQL_DATABASE_NAME = <Enter your SQL Database Name>
$Env:KEY_VAULT_NAME = <Enter your Key Vault Name>

./<Name of project>/output/<Name of project> ./Path to PDF.pdf
Enter fullscreen mode Exit fullscreen mode

Limitations

As noted in the comments of the Section 5 code block the SQL commands will fail to create the tables if they already exist. This is a limitation of the AzFormRecognizer.Table.ToSQL library. This will be fixed in a future release. In the meantime, you can comment out the following line and run the program again for new PDFs.

Conclusion

The above functionality is a work in progress as I develop my own personal financial analysis tool. I hope this article has been helpful in showing you how to use Form Recognizer to extract data from PDFs and how to use the data to create SQL commands. If you have any questions or comments, please feel free to reach out.

💖 💪 🙅 🚩
paulriviera
Paul Riviera

Posted on February 7, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related