Cleaning up personal expense data with SQL

p_desch

Patrick Deschere

Posted on September 19, 2024

Cleaning up personal expense data with SQL

In this post, I share my experience practicing SQL by cleaning up data related to personal expense transactions, cleaning the data again (and again), and then finally performing a bit of analysis. I'm still a beginner and hope this article might be interesting to other newbies like myself.

Also! For the sake of writing this post and maintaining a bit of privacy, I made a "demo" copy of my expenses which includes modified dollar amounts and business names. That said... I did not attempt to hide any of my numerous fried chicken purchases 🙈

Goal: Practice SQL with data relevant to my life

Earlier this year, I completed an introductory SQL course that included a lot of exercises related to analyzing ye olde Olympics, Netflix, and FIFA databases. Afterward, I was hungry to continue practicing SQL, but with a set of data more personal and relevant to my life.

Like many people, I routinely spend money on stuff. My previous system for tracking expenses included a horrible Microsoft Excel sheet and too much time spent normalizing data. So I decided to see if SQL would be a better for cleaning, summarizing, and getting insights about my personal expenses.

Plus, I would be able to answer some truly important questions, like, "Which takeout restaurant do I get most frequently?" and "Do I have a fried chicken addiction?"

Part 1: Table design and creation

The first step was reviewing the available data, understanding my goals, and designing a table schema to support those goals.

Bank and credit card data

I downloaded a few CSV files containing bank account and credit card transactions spanning the past two years. These files contained some useful information related to each transaction:

  • Transaction date
  • Amount
  • Description (e.g., TST* Fried Chicken Restaurant LLC)
  • Type (e.g., Credit, Debit, Sale, Return)
  • Category (e.g., Shopping, Groceries, Travel) automatically assigned by the credit card company. This column didn't exist in the files from the bank account.

Table schema planning

Before writing any code, I spent some time thinking about my goals and how I would use the database in the future:

  • Short-term goal: Summarize spending by category, time period (e.g., year, quarter, month), and understand month "fixed" or "essential" spending.
  • Long-term goal: If this project were approved by my wife (a business analyst herself), I might import her expenses into the system and understand how we're sharing household expenses.

Theoretically, these goals could be accomplished with one table. However, in order to retain flexibility (and avoiding bulk revising data later), I included two tables.

 --------------------------------         ----------------------------
| transactions     |             |       | accounts     |             |
| ---------------- | ----------- |       | ------------ | ----------- |
| transaction_id   | primary key |   |‾‾‾| account_id   | primary key | 
| transaction_date |             |   |   | account_name |             |
| account_id       | foreign key |‾‾‾    | account_type |             |
| description      |             |       | institution  |             |
| category         |             |       | owner        |             |
| type             |             |        ----------------------------
| amount           |             |      
 --------------------------------        
Enter fullscreen mode Exit fullscreen mode

The transactions table would store individual records for each incoming and outgoing transaction belonging to all bank and credit card accounts. The second table accounts would store extra information about each account (and perhaps someday would be used to know which expenses belong to accounts owned by me or my wife).

Create the tables!

You can see more details about the tables and fields by referencing the below code. For a newbie like me, the most exciting part was using DECIMAL(10,2) for the amount field, which would accommodate dollar and cent amounts.

CREATE TABLE accounts (
    account_id INTEGER PRIMARY KEY
    account_name VARCHAR(255) NOT NULL
    account_type VARCHAR(50) NOT NULL
    institution VARCHAR(50) NOT NULL
    owner VARCHAR(50) NULL
);

CREATE TABLE transactions (
    transaction_id SERIAL PRIMARY KEY,
    transaction_date DATE NOT NULL,
    account_id INTEGER REFERENCES accounts(account_id) NOT NULL
    description VARCHAR(255) NOT NULL
    category VARCHAR(255) NOT NULL
    type VARCHAR(255) NOT NULL
    amount DECIMAL(10,2) NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

Cleaning data without modifying the data

After creating the tables and importing about 1,500 transactions, I performed a simple query to review the data and get a sense of potential issues to fix.

SELECT
    transaction_date,
    description,
    category,
    type,
    amount
FROM transactions_demo
ORDER BY transaction_date DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode
transaction_date | description                                | category          | type    |   amount   
---------------- | ------------------------------------------ | ----------------- | ------- | -------- 
2024-07-31       | Uniqlo USA LLC                             | Shopping          | Sale    |  -101.08 
2024-07-31       | ETERNAL HOME LOANS   0000000000  WEB ID... | NULL              | DEBIT   | -2348.33 
2024-07-31       | GOOGLE *FI DH0TCM                          | Bills & Utilities | Sale    |  -103.39 
2024-07-30       | Credit Card Payment - Thank You            | NULL              | Payment |  1972.01 
2024-07-30       | PAYMENT TO EMPIRE BANK CARD ENDING IN 0... | NULL              | DEBIT   | -1972.01 
2024-07-30       | TTP* LOVELY DOGGO WALKO                    | Personal          | Sale    |   -20.80 
2024-07-29       | GOOGLE *Peacock                            | Shopping          | Sale    |   -12.35 
2024-07-29       | NEIGHBORS GROCERY OUT                      | Groceries         | Sale    |   -31.57 
2024-07-29       | AMC 0000 ONLINE                            | Entertainment     | Sale    |   -39.78 
2024-07-26       | TAQUERIA EL                                | Food & Drink      | Sale    |   -24.89 
2024-07-25       | SQ *JACK STRANDS SALON                     | Personal          | Sale    |   -56.64 
2024-07-25       | GREYHOUND VETERINARY H                     | Personal          | Sale    |   -87.19 
2024-07-24       | B&H PHOTO 000-000-0000                 | Shopping          | Sale    |   -70.58 
2024-07-24       | EVIL INTERNET & CABLE CORP                 | Bills & Utilities | Sale    |   -77.25 
...              | ...                                        | ...               | ...     |      ... 
Enter fullscreen mode Exit fullscreen mode

As you see, the data required cleaning and standardization before it would be useful for analysis. Here are a few issues:

  • The description column contained inconsistent formats, extra characters (e.g., strings of text related to payment processing services, random invoice numbers), and sometimes the same business appeared differently.
  • There were NULL values in the category field for all bank transactions, because the bank account didn't automatically assign a category like the credit cards accounts did.
  • Also, while the category field was mostly correct, some transactions required a category different from the one assigned by the credit card. For example, I would rather assign dog food, vet bills, and dog walker fees to a new Dog expense category.
  • The data in the type column wasn't necessarily wrong, but I felt it would benefit from some consolidation.
  • There were a lot of transactions that would need to be ignored when summarizing spending habits. For example, transfers between my checking account and savings accounts, transfers to investment accounts, regular credit card payments.

Since I was still exploring the data and wasn't sure how I would use them in the future, I wanted to fix the above issues without modifying the data itself.

Consolidating transaction types

The default transaction types were not very intuitive, so I decided to consolidate them into three categories.

Here are the default types.

SELECT DISTINCT(type) FROM transactions_demo;
Enter fullscreen mode Exit fullscreen mode
type
----------
CREDIT  
Payment     
Adjustment  
CHECK           
DSLIP
Sale
Return
Fee
DEBIT
Enter fullscreen mode Exit fullscreen mode

I wrote the below CASE clause to consolidate them into three types: In, Out, and Return. Technically, returned purchases could be grouped with the other incoming transactions, but for now I wanted an easy to separate them -- just case I wanted to filter say salaries and other income but keep returned purchases.

CASE
    WHEN type ~ 'Adjustment|CREDIT|DSLIP'
        THEN 'In'
    WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
        THEN 'Out'
    ELSE INITCAP(type) END AS clean_type
Enter fullscreen mode Exit fullscreen mode

Below are the results with the new clean_type added.

SELECT
    description,
    type,
    CASE
        WHEN type ~ 'Adjustment|CREDIT|DSLIP'
            THEN 'In'
        WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
            THEN 'Out'
        ELSE INITCAP(type) END AS clean_type
FROM clean_transactions
ORDER BY transaction_date DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode
description                          | type   | clean_type
------------------------------------ | ------ | ----------
TTP* LOVELY DOGGO WALKO              | Sale   | Out
BUZZ N WOODY PIZZA                   | Sale   | Out
REGIONAL LANDMARK ENERGY INC         | Sale   | Out
TST*LUCKY BREWING - CAP              | Sale   | Out
ROBBER BARON INVESTMENTS PPD ID: ... | DEBIT  | Out
TTP* LOVELY DOGGO WALKO              | Sale   | Out
GREYHOUND VETERINARY H               | Sale   | Out
NEW TECH NORTHWEST                   | Sale   | Out
REMOTE ONLINE DEPOSIT #              | DSLIP  | In
TST* DIN TAI FUNG - BELLE            | Sale   | Out
ETERNAL HOME LOANS   0000000000      | DEBIT  | Out
GOOGLE *FI DH0TCM                    | Sale   | Out
WALLSTREET BROS SAVINGS: 0000000000  | CREDIT | In
...                                  | ...    | ...
Enter fullscreen mode Exit fullscreen mode

Fixing incorrect and NULL categories

Next on the list was cleaning up the NULL values and wrong categories in the category field. To get a sense of the categories, here are the distinct values assigned by my credit cards.

SELECT DISTINCT(category)
FROM transactions_demo
ORDER BY category;
Enter fullscreen mode Exit fullscreen mode
category
---------------------
Automotive
Bills & Utilities
Education
Entertainment
Fees & Adjustments
Food & Drink
Gas
Gifts & Donations
Groceries
Health & Wellness
Home
Personal
Professional Services
Shopping
Travel
Enter fullscreen mode Exit fullscreen mode

Identifying records with NULL values was simple, but figuring out which transactions would need a different category took a little more time. For those records, I prioritized identifying which recurring transactions needed attention. To do so, I used the below query.

Some vendor names in the description field were not consistent. For example, my mortgage servicing company appeared as ETERNAL HOME LOANS 000... in some places, but also appeared as ETERNAL HOME LOANS PAYMENTS 000... in some places. In order to group similar descriptions like these, I used the LEFT function to only return the first 14 characters of each description.

SELECT 
    COUNT(*),
    LEFT(description,14) AS short_description,
    category
FROM transactions_demo
WHERE amount < 0
GROUP BY short_description, category
ORDER BY count DESC;
Enter fullscreen mode Exit fullscreen mode
count | short_description | category
----- | ------------------| --------
83    | TTP* LOVELY DO    | Personal
43    | PAYMENT TO EMP    | NULL
39    | COSTCO WHSE #0    | Shopping
33    | WALLSTREET BRO    | NULL
31    | CHEWY.COM         | Shopping
29    | NEIGHBORS GROC    | Groceries
27    | APPLE.COM/BILL    | Shopping
26    | ROXY'S FRIED C    | Food & Drink
25    | BUZZ N WOODY P    | Food & Drink
24    | ETERNAL HOME L    | NULL
23    | EVIL INTERNET     | Bills & Utilities
23    | REGIONAL LANDM    | Bills & Utilities
20    | SQ *NEXT DOOR     | Food & Drink
19    | COSTCO GAS #00    | Gas
19    | COOKUNITY INC     | Groceries
15    | SQ *JACK STRAN    | Personal
14    | MCDONALD'S F00    | Food & Drink
14    | GREYHOUND VETE    | Personal
14    | INVESTCORP  BU    | NULL
13    | AVG PARKING 00    | Travel
...   | ...               | ...
Enter fullscreen mode Exit fullscreen mode

After investigating, I compiled a list of issues. Here are a couple:

  • As mentioned previously, transactions associated with my bank account were all NULL and would need a new category assigned. For example, there were 24 transfers to my mortgage servicing company ETERNAL HOME LOANS where the category should be changed from NULL to Mortgage.
  • There were 39 transactions to COSTCO WHSE assigned to Shopping. I wanted to change these to Groceries since the vast majority of my purchases at Costco are food and groceries.
  • Also noted earlier, there were many transactions related to dog ownership, but they were assigned different categories like Personal and Shopping. I wanted to replace the automatically assigned categories with a new Dog category.
  • I wanted to consolidate Automative and Gas into one category Automative & Gas, so it would be easier to group expenses related to owning and driving my car.

Since there wasn't a huge number of recurring vendors and services, I decided the easiest solution would be to use another CASE clause. For example, if the description contained CHEWY.COM, GREYHOUND VETERINARY, or LOVELY DOGGO WALKO, then the category would be replaced with Dog.

After reviewing the most frequently recurring descriptions, I identified several patterns to include in the below CASE statement.

CASE
    WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive' 
        THEN 'Automotive & Gas'
    WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment' 
        THEN 'Credit card'
    WHEN description ~ '^REDEMPTION CREDIT' 
        THEN 'Points cashback'
    WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT' 
        THEN 'Dog'
    WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock' 
        THEN 'Entertainment'
    WHEN description ~ 'COSTCO WHSE|INSTACART' 
        THEN 'Groceries'
    WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER' 
        THEN 'Travel'
    WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #' 
        THEN 'Misc'
    WHEN description ~ '^ETERNAL HOME LOANS' 
        THEN 'Mortgage'
    WHEN description ~ '^DEPOSIT  ID NUMBER|^IRS.*TAX REF' 
        THEN 'Other income'
    WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
        THEN 'Reimbursement'
    WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer' 
        THEN 'Saving & Investing'
    WHEN description ~ '^CORPO INC PAYROLL' 
        THEN 'Wages & Salaries'             
    ELSE category END AS clean_category
Enter fullscreen mode Exit fullscreen mode

After dropping this into my query from last time...

SELECT
    transaction_date,
    description,
    category,
    type,
    CASE
        WHEN type ~ 'Adjustment|Return'
            THEN 'Returns & Cashback'
        WHEN type ~ 'CREDIT|DSLIP'
            THEN 'Income & Deposits'
        WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
            THEN 'Payments & Transfers'
        ELSE INITCAP(type) END AS clean_type,
    CASE
    WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive' 
        THEN 'Automotive & Gas'
    WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment' 
        THEN 'Credit card'
    WHEN description ~ '^REDEMPTION CREDIT' 
        THEN 'Points cashback'
    WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT' 
        THEN 'Dog'
    WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock' 
        THEN 'Entertainment'
    WHEN description ~ 'COSTCO WHSE|INSTACART' 
        THEN 'Groceries'
    WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER' 
        THEN 'Travel'
    WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #' 
        THEN 'Misc'
    WHEN description ~ '^ETERNAL HOME LOANS' 
        THEN 'Mortgage'
    WHEN description ~ '^DEPOSIT  ID NUMBER|^IRS.*TAX REF' 
        THEN 'Other income'
    WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
        THEN 'Reimbursement'
    WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer' 
        THEN 'Saving & Investing'
    WHEN description ~ '^CORPO INC PAYROLL' 
        THEN 'Wages & Salaries'             
    ELSE category END AS clean_category,
    amount
FROM transactions_demo
WHERE transaction_date <= '2024-08-06' 
AND description !~ 'NEW TECH NORTHWEST'
ORDER BY transaction_date DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

...I got the below example of the clean_category column.

description                               | category          | clean_category
----------------------------------------- | ----------------- | ---------------
ROBBER BARON INVESTMENTS PPD ID: 00000... | NULL              | Saving & Investing
GREYHOUND VETERINARY H                    | Personal          | Dog
TTP* LOVELY DOGGO WALKO                   | Personal          | Dog
REMOTE ONLINE DEPOSIT #          0        | NULL              | Reimbursement
TST* DIN TAI FUNG - BELLE                 | Food & Drink      | Food & Drink
Uniqlo USA LLC                            | Shopping          | Shopping
ETERNAL HOME LOANS   0000000000      W... | NULL              | Mortgage
GOOGLE *FI DH0TCM                         | Bills & Utilities | Bills & Utilities
TTP* LOVELY DOGGO WALKO                   | Personal          | Dog
Credit Card Payment - Thank You           | NULL              | Credit card
PAYMENT TO EMPIRE BANK CARD ENDING IN ... | NULL              | Credit card
GOOGLE *Peacock                           | Shopping          | Entertainment
AMC 0000 ONLINE                           | Entertainment     | Entertainment
NEIGHBORS GROCERY OUT                     | Groceries         | Groceries
WALLSTREET BROS SAVINGS: 0000000000       | NULL              | Saving & Investing
TAQUERIA EL                               | Food & Drink      | Food & Drink
GREYHOUND VETERINARY H                    | Personal          | Dog
SQ *JACK STRANDS SALON                    | Personal          | Personal
FANDANGO                                  | Entertainment     | Entertainment
TRVL CARD BANK  ONLINE PMT CKF00000000... | NULL              | Travel
Enter fullscreen mode Exit fullscreen mode

Removing chaos from the descriptions

The description field includes a note about each transaction. For example, it could be the vendor from which I purchased something or the name of a savings account to which I sent money.

The descriptions had numerous formatting inconsistencies and included extra characters that made it difficult to group them together. For example, each transaction related to Google Fi (my phone service) included a unique invoice code. Take a look below.

SELECT 
    transaction_date,
    description
FROM transactions_demo
WHERE description ~* '^Google.+Fi'
ORDER BY transaction_date DESC
LIMIT 25;
Enter fullscreen mode Exit fullscreen mode
transaction_date | description       
---------------- | ----------------- 
2024-07-31       | GOOGLE *FI DH0TCM 
2024-07-01       | GOOGLE *FI M0zWRQ 
2024-05-31       | GOOGLE *FI 00CNX0 
2024-05-01       | GOOGLE *FI 0GB00X 
2024-03-31       | GOOGLE *FI 0S0HNC 
2024-03-02       | GOOGLE *FI XKW00N 
2024-01-31       | GOOGLE *FI D0J0RZ 
2023-12-31       | GOOGLE *FI G0JRTL 
2023-12-01       | GOOGLE *FI L0CQ00 
2023-10-31       | GOOGLE *FI 0HFBR0 
2023-10-01       | GOOGLE *FI vvRRSB 
2023-08-31       | GOOGLE *FI CV00MN 
2023-07-31       | GOOGLE *FI 00LRXC 
...              | ...               

Enter fullscreen mode Exit fullscreen mode

If I wanted to answer a question like "How much did I spend on phone service last year," standardizing the description to something like Google Fi would make it easier to group the transactions.

Next, this is mainly an aesthetic problem, but take a look at dining and takeout transactions.

SELECT
    description,
    category
FROM transactions_demo
WHERE category = 'Food & Drink'
ORDER BY transaction_date DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode
'description'             | 'category'   
------------------------- | ------------ 
MESOPOTAMIA MIX           | Food & Drink 
TST*LUCKY BREWING - CAP   | Food & Drink 
HOLE IN WALL THAI LLC     | Food & Drink 
TST*ROCKET TACO - TOWN    | Food & Drink 
CAFE BRIO                 | Food & Drink 
SQ *RED FISH BLUE FISH    | Food & Drink 
SQ *CRAFT (VICTORIA) LTD. | Food & Drink 
FRIENDS &amp; FAMILY BAKE | Food & Drink 
BUZZ N WOODY PIZZA        | Food & Drink 
TST* DIN TAI FUNG         | Food & Drink 
TAQUERIA EL               | Food & Drink 
TWSP PRK ADAM GOLF RES0   | Food & Drink 
SQ *FATS WINGS AND WAFFLE | Food & Drink 
SQ *BEANFISH              | Food & Drink 
SQ *FATS WINGS AND WAFFLE | Food & Drink 
TST* RETURNER CINEMA      | Food & Drink 
...                       | ...          
Enter fullscreen mode Exit fullscreen mode

Restaurants and many other businesses use payment processing services like Square or Toast, which insert a short string like SQ * and TST* at the beginning of each description. I wanted to remove these extra strings (and other characters) to enhance readability of the data. Like this:

  • SQ *RED FISH BLUE FISH --> RED FISH BLUE FISH
  • SQ *CRAFT (VICTORIA) LTD. --> CRAFT VICTORIA
  • TST*LUCKY BREWING - CAP --> LUCKY BREWING - CAP

Next I checked descriptions that included a few special characters:

SELECT
    description,
    category
FROM transactions_demo
WHERE description ~ '\*|\#|\-' 
ORDER BY transaction_date DESC
LIMIT 25;
Enter fullscreen mode Exit fullscreen mode
description                        | category          
---------------------------------- | ----------------- 
00000 - 000 DEXTER JUNO            | Travel            
COSTCO WHSE #0000                  | Shopping          
REMOTE ONLINE DEPOSIT #     000000 | NULL              
GOOGLE *Peacock                    | Shopping          
B\&amp;H PHOTO 000-000-0000        | Shopping          
AMAZON MKTPL*RJ0DG00F0             | Shopping          
LOWES #00000*                      | Home              
COSTCO GAS #0000                   | Gas               

Enter fullscreen mode Exit fullscreen mode

Based on that query, I wanted to remove extra characters (and numbers) like these:

  • COSTCO WHSE #0000 --> COSTCO WHSE
  • COSTCO GAS #0000 --> COSTCO GAS
  • LOWES #00000* --> LOWES

And fix special characters like the ampersand in this description:

  • B&amp;H PHOTO --> B&H PHOTO

Finally, there were many transactions that included a mix of letters, numbers, and special characters at the end of the string. Look at the utter mess below.

SELECT description
FROM transactions_demo
WHERE description ~* '\d+'
ORDER BY transaction_date DESC
LIMIT 25;
Enter fullscreen mode Exit fullscreen mode
description     
--------------------------------------------------------------
ROBBER BARON INVESTMENTS PPD ID: 0000000000
ETERNAL HOME LOANS   0000000000      WEB ID: 0000000000
PAYMENT TO EMPIRE BANK CARD ENDING IN 0000 00/00
WALLSTREET BROS SAVINGS: 0000000000
TRVL CARD BANK  ONLINE PMT CKF000000000POS WEB ID: 0000000000
APA  TREAS 000     MISC PAY                 PPD ID: 0000000000
APA  TREAS 000     MISC PAY                 PPD ID: 0000000000
...
...
Enter fullscreen mode Exit fullscreen mode

In these situations, I wanted to change something like ETERNAL HOME LOANS 0000000000 WEB ID: 0000000000 to just ETERNAL HOME LOANS.

Regular expressions to the rescue

By exploring the description column and reviewing the most frequently occurring records, I identified several patterns that should be deleted or fixed. From there, I wrote a a series of nested REPLACE and REGEXP_REPLACE clauses that would clean up the descriptions.

-- Change first letter to uppercase, rest of string lower case
TRIM(INITCAP(
    REGEXP_REPLACE(
        REGEXP_REPLACE(
            REGEXP_REPLACE(
                REGEXP_REPLACE(
                    REPLACE(
                        REPLACE(description,'''',''),
                    -- Fix ampersands (&)
                    '&amp;', '&'),
                -- Replace strings of letters and numbers after a special character (as long as it appears after the eighth char)
                '(?<!^.{0,8})\W\d+.+$',' ''g'),
            -- Replace certain patterns (e.g., 'TST*'), special characters, unwanted strings(e.g. 'LLC')
            '^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
        -- Replace single characters hanging at the end of a string (e.g. 'Veterinary H' --> 'Veterinary')
        '\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),  
    -- Change multiple spaces into one space
    '\s+', ' ', 'g')
)) AS clean_description
Enter fullscreen mode Exit fullscreen mode

I assume someone with more experience would be able to write a more concise version, but I figured it was good enough for me. Let's take a look...

SELECT 
    transaction_date,
    description,
    TRIM(INITCAP(
        REGEXP_REPLACE(
            REGEXP_REPLACE(
                REGEXP_REPLACE(
                    REGEXP_REPLACE(
                        REPLACE(
                            REPLACE(description,'''',''),
                        '&amp;', '&'),
                    '(?<!^.{0,8})\W\d+.+$',' ''g'),
                '^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
            '\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),  
        '\s+', ' ', 'g')
        )) AS clean_description
FROM transactions_demo
ORDER BY transaction_date DESC
LIMIT 45;
Enter fullscreen mode Exit fullscreen mode
'description'                            | 'clean_description'                   
---------------------------------------- | ------------------------------------- 
TTP* LOVELY DOGGO WALKO                  | Lovely Doggo Walko                    
TST*LUCKY BREWING - CAP                  | Lucky Brewing Cap                     
00000 - 000 DEXTER JUNO                  | Dexter Juno                           
TTP* LOVELY DOGGO WALKO                  | Lovely Doggo Walko                    
EVIL INTERNET & CABLE CORP               | Evil Internet & Cable Corp            
HOLE IN WALL THAI LLC                    | Hole In Wall Thai                     
COSTCO WHSE #0000                        | Costco Whse                           
COAST VICTORIA HOTE &amp;                | Coast Victoria Hote &                 
PY *FRS CLIPPER                          | Frs Clipper                           
CHEWY.COM                                | Chewy.Com                             
SQ *CRAFT (VICTORIA) LTD.                | Craft Victoria                        
IRISH TIMES PUB                          | Irish Times Pub                       
FRIENDS &amp; FAMILY BAKE                | Friends & Family Bake                 
SQ *RED FISH BLUE FISH                   | Red Fish Blue Fish                    
ROBBER BARON INVESTMENTS PPD ID: 0000... | Robber Baron Investments              
TTP* LOVELY DOGGO WALKO                  | Lovely Doggo Walko                    
GREYHOUND VETERINARY H                   | Greyhound Veterinary                  
REMOTE ONLINE DEPOSIT #          0       | Remote Online Deposit                 
ETERNAL HOME LOANS  0000000  WEB ID: ... | Eternal Home Loans                    
GOOGLE *FI DH0TCM                        | Google Fi                             
Uniqlo USA LLC                           | Uniqlo Usa                            
TTP* LOVELY DOGGO WALKO                  | Lovely Doggo Walko                    
PAYMENT TO EMPIRE BANK CARD ENDING IN... | Payment To Empire Bank Card Ending In 
...                                      | ...                                   

Enter fullscreen mode Exit fullscreen mode

...And it did a pretty GOOD job!!!!

However, it wasn't perfect. For example, let's look at purchases from Amazon.

description            | clean_description  
---------------------- | ------------------ 
AMAZON MKTPL*RJ0DG00F0 | Amazon Mktpl Rj Dg 
AMZN Mktp US*R00000RQ0 | Amzn Mktp Us R Rq  
AMAZON MKTPL*000GV0A00 | Amazon Mktpl       
AMZN Mktp US*BZ0E00T00 | Amzn Mktp Us Bz E  
Amazon.com             | Amazon.Com         
AMZN Mktp US*DK00L00F0 | Amzn Mktp Us Dk L  
Amazon.com*R00Y00R00   | Amazon.Com R Y     
...
Enter fullscreen mode Exit fullscreen mode

For recurring vendors and services, I decided it would be faster to add some specialized rules to the CASE statement. Back when I was fixing the category field earlier, I already made a list of frequently recurring vendors and patterns in the description field, so it was pretty easy to create another clause as below.

TRIM(CASE
    WHEN description ~ 'PAYBYPHONE'
        THEN 'PayByPhone Parking'
    WHEN description ~* 'GOOGLE.*FI'
        THEN 'Google Fi'
    WHEN description ~ 'LOVELY DOGGO WALKO'
        THEN 'Lovely Doggo Walko'
    WHEN description ~* 'PLAYSTATION'
        THEN 'PlayStation Network'
    WHEN description ~ 'SOUTHERN CHICKEN AND WAFF'
        THEN 'Southern Chicken And Waffle'
    WHEN description ~ 'NEIGHBORS GROCERY OUT'
        THEN 'Neighbors Grocery Outlet'
    WHEN description ~ 'AmazonStores|Amazon Fresh'
        THEN 'Amazon Fresh'
    WHEN description ~ 'WALLSTREET BROS'
        THEN 'Wallstreet Bros Savings'
    WHEN description ~* 'AMZN Mktp|Amazon.com|AMAZON MKTPL'
        THEN 'Amazon.com'
    WHEN description ~ 'ATM WITHDRAWAL'
        THEN 'ATM Withdrawal'
    WHEN description ~ 'ETERNAL HOME LOANS'
        THEN 'Eternal Home Loans'
    WHEN description ~ 'AIRBNB'
        THEN 'Airbnb'
    WHEN description ~ 'LYFT.*RIDE'
        THEN 'Lyft Ride'
    WHEN description ~* 'Uniqlo'
        THEN 'Uniqlo'
    ELSE INITCAP(
        REGEXP_REPLACE(
            REGEXP_REPLACE(
                REGEXP_REPLACE(
                    REGEXP_REPLACE(
                        REPLACE(
                            REPLACE(description,'''',''),
                        '&amp;', '&'),
                    '(?<!^.{0,8})\W\d+.+$',' ''g'),
                '^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
            '\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),  
        '\s+', ' ', 'g')
    ) END) AS clean_description
Enter fullscreen mode Exit fullscreen mode

Move the query to a common table expression

At this point, the new clean_description, clean_category, and clean_type columns fixed most of the inconsistent and missing data. It was time to start summarizing and analyzing my expenses! But first... let's examine one more important step.

When I tried to filter transactions using a WHERE clause with the clean_category field, as you can see below...

SELECT
    transaction_date,
    description,
    CASE
        WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive' 
            THEN 'Automotive & Gas'
        WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment' 
            THEN 'Credit card'
        WHEN description ~ '^REDEMPTION CREDIT' 
            THEN 'Points cashback'
        WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT' 
            THEN 'Dog'
        WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock' 
            THEN 'Entertainment'
        WHEN description ~ 'PHO SUP SHOP|BLING BLING ASIAN MARKET' 
            THEN 'Food & Drink' 
        WHEN description ~ 'COSTCO WHSE|INSTACART' 
            THEN 'Groceries'
        WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER' 
            THEN 'Travel'
        WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #' 
            THEN 'Misc'
        WHEN description ~ '^ETERNAL HOME LOANS' 
            THEN 'Mortgage'
        WHEN description ~ '^DEPOSIT  ID NUMBER|^IRS.*TAX REF' 
            THEN 'Other income'
        WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
            THEN 'Reimbursement'
        WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer' 
            THEN 'Saving & Investing'
        WHEN description ~ '^CORPO INC PAYROLL' 
            THEN 'Wages & Salaries'             
        ELSE category END AS clean_category,
    amount
FROM transactions_demo
-- Only show results from the Food & Drink category
WHERE clean_category = 'Food & Drink'
Enter fullscreen mode Exit fullscreen mode

...I encountered this error.

ERROR:  column "clean_category" does not exist
LINE 98: WHERE clean_category = 'Food & Drink'
               ^ 
Enter fullscreen mode Exit fullscreen mode

Why? Well, when resolving a query, SQL checks the content in the WHERE clause before resolving the SELECT portion of the statement (makes sense because we only want data matching the WHERE clause to be processed and returned). Since the CASE clause is processed along with the rest of the SELECT section, it doesn't exist when WHERE is resolved. So my clean_category column didn't exist when the WHERE clause was resolved.

To fix this issue, I moved the query to a CTE (common table expression). That way, my "clean" columns would be stored in something like a temporary table, and I would be able to reference the cleaned data when writing a WHERE clause. Here is an example of the query after moving it to a CTE.

WITH 
    -- Common table expression (CTE) containing my cleaned data 
    clean_transactions AS (
        SELECT
            transaction_id,
            transaction_date,
            -- Replace common descriptions and remove extra characters and symbols
            TRIM(CASE
                WHEN description ~ 'PAYBYPHONE'
                    THEN 'PayByPhone Parking'
                WHEN description ~* 'GOOGLE.*FI'
                    THEN 'Google Fi'
                WHEN description ~ 'LOVELY DOGGO WALKO'
                    THEN 'Lovely Doggo Walko'
                WHEN description ~* 'PLAYSTATION'
                    THEN 'PlayStation Network'
                WHEN description ~ 'SOUTHERN CHICKEN AND WAFF'
                    THEN 'Southern Chicken And Waffle'
                WHEN description ~ 'NEIGHBORS GROCERY OUT'
                    THEN 'Neighbors Grocery Outlet'
                WHEN description ~ 'AmazonStores|Amazon Fresh'
                    THEN 'Amazon Fresh'
                WHEN description ~ 'WALLSTREET BROS'
                    THEN 'Wallstreet Bros Savings'
                WHEN description ~* 'AMZN Mktp|Amazon.com|AMAZON MKTPL'
                    THEN 'Amazon.com'
                WHEN description ~ 'ATM WITHDRAWAL'
                    THEN 'ATM Withdrawal'
                WHEN description ~ 'ETERNAL HOME LOANS'
                    THEN 'Eternal Home Loans'
                WHEN description ~ 'AIRBNB'
                    THEN 'Airbnb'
                WHEN description ~ 'LYFT.*RIDE'
                    THEN 'Lyft Ride'
                WHEN description ~* 'Uniqlo'
                    THEN 'Uniqlo'
                ELSE INITCAP(
                    REGEXP_REPLACE(
                        REGEXP_REPLACE(
                            REGEXP_REPLACE(
                                REGEXP_REPLACE(
                                    REPLACE(
                                        REPLACE(description,'''',''),
                                    '&amp;', '&'),
                                -- Replace strings of letters and numbers after a special character (as long as it appears after the eighth char)
                                '(?<!^.{0,8})\W\d+.+$',' ''g'),
                            -- Replace certain patterns (e.g., 'TST*'), special characters, unwanted strings(e.g. 'LLC')
                            '^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
                        -- Replace single characters hanging at the end of a string (e.g. 'Veterinary H' --> 'Veterinary')
                        '\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),  
                    -- Change multiple spaces into one space
                    '\s+', ' ', 'g')
                ) END) AS clean_description,
            -- Fix wrong categories and add new ones (e.g., Dog, Mortgage, Wages & Salary)
            CASE
                WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive' 
                    THEN 'Automotive & Gas'
                WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment' 
                    THEN 'Credit card'
                WHEN description ~ '^REDEMPTION CREDIT' 
                    THEN 'Points cashback'
                WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT' 
                    THEN 'Dog'
                WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock' 
                    THEN 'Entertainment'
                WHEN description ~ 'COSTCO WHSE|INSTACART' 
                    THEN 'Groceries'
                WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER' 
                    THEN 'Travel'
                WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #' 
                    THEN 'Misc'
                WHEN description ~ '^ETERNAL HOME LOANS' 
                    THEN 'Mortgage'
                WHEN description ~ '^DEPOSIT  ID NUMBER|^IRS.*TAX REF' 
                    THEN 'Other income'
                WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
                    THEN 'Reimbursement'
                WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer' 
                    THEN 'Saving & Investing'
                WHEN description ~ '^CORPO INC PAYROLL' 
                    THEN 'Wages & Salaries'             
                ELSE category END AS clean_category,
            -- Group transaction type to simplify grouping later
            CASE
                WHEN type ~ 'Adjustment|Return'
                    THEN 'Returns & Cashback'
                WHEN type ~ 'CREDIT|DSLIP'
                    THEN 'Income & Deposits'
                WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
                    THEN 'Payments & Transfers'
                ELSE INITCAP(type) END AS clean_type,
            amount,
            -- Add year, quarter, and month for summarizing transactions later
            EXTRACT(YEAR FROM transaction_date) AS year,
            TO_CHAR(transaction_date,'YYYY"_Q"Q') as qtr_string,
            TO_CHAR(transaction_date,'YYYY_MM') as month,
            account_id
        FROM transactions_demo
    )
-- Query referencing the new 'clean_transactions' CTE
SELECT
    transaction_date,
    clean_description,
    amount
FROM clean_transactions
WHERE clean_category = 'Food & Drink'
ORDER BY transaction_date DESC;
Enter fullscreen mode Exit fullscreen mode
transaction_date | clean_description     |  amount  
---------------- | --------------------- | ------- 
2024-08-21       | Buzz N Woody Pizza    |  -50.33  
2024-08-17       | Lucky Brewing Cap     |   -7.42      
2024-08-17       | Mesopotamia Mix       |  -29.06  
2024-08-13       | Hole In Wall Thai     |  -52.50  
2024-08-12       | Rocket Taco Town      |  -33.59  
2024-08-11       | Coast Victoria Hote & |  -42.62  
2024-08-10       | Cafe Brio             | -121.86  
2024-08-09       | Friends & Family Bake |   -7.90      
2024-08-09       | Red Fish Blue Fish    |  -32.86  
2024-08-09       | Irish Times Pub       |  -74.75  
2024-08-09       | Craft Victoria        |  -31.78  
2024-08-03       | Din Tai Fung Belle    |  -40.10  
2024-07-26       | Taqueria El           |  -24.89  
...              | ...                   |     ...       

Enter fullscreen mode Exit fullscreen mode

Yes!!! With the clean_transactions CTE, I would be able to finally start summarizing and analyzing the transactions.

A couple quick queries

To start, let's look at the total spend per category along with the percentage of total spend that year.

WITH
    ...
    ...
    ...
SELECT 
    year,
    clean_category,
    SUM(amount) AS total_spend,
    ROUND(SUM(amount) / (SUM(SUM(amount)) OVER(PARTITION BY year))*100,2)  AS pct_of_spend
FROM clean_transactions
WHERE clean_category NOT IN ('Wages & Salaries','Credit card','Saving & Investing','Other income','Points cashback')
GROUP BY year, clean_category
ORDER BY year DESC, total_spend;
Enter fullscreen mode Exit fullscreen mode
year | clean_category        | total_spend | pct_of_spend 
---- | --------------------- | ----------- | ------------ 
2024 | Mortgage              |   -16196.71 |        38.78 
2024 | Shopping              |    -5309.93 |        12.71 
2024 | Travel                |    -4350.83 |        10.42 
2024 | Dog                   |    -3255.04 |         7.79 
2024 | Bills & Utilities     |    -3208.90 |         7.68 
2024 | Food & Drink          |    -3021.83 |         7.24 
2024 | Groceries             |    -1830.94 |         4.38 
2024 | Automotive & Gas      |    -1446.29 |         3.46 
2024 | Misc                  |     -760.00 |         1.82 
2024 | Home                  |     -715.45 |         1.71 
2024 | Entertainment         |     -687.91 |         1.65 
2024 | Personal              |     -543.92 |         1.30 
2024 | Professional Services |     -393.02 |         0.94 
2024 | Health & Wellness     |     -170.78 |         0.41 
2024 | Gifts & Donations     |      -76.00 |         0.18 
...  | ...                   |         ... |          ... 

Enter fullscreen mode Exit fullscreen mode

Or, just for fun, I could check my most frequented restaurants in 2024.

WITH
    ...
    ...
    ...
SELECT 
    COUNT(clean_description) AS visits,
    clean_description AS restaurant
FROM clean_transactions 
WHERE clean_category IN('Food & Drink')
    AND year = '2024'
GROUP BY clean_description
ORDER BY visits DESC;

Enter fullscreen mode Exit fullscreen mode
visits | restaurant                 
------ | ---------------------------
11     | Roxys Fried Chicken        
6      | Buzz N Woody Pizza         
4      | Mesopotamia Mix            
4      | Mcdonalds                  
3      | Southern Chicken And Waffle
2      | Fantasy Donut Force        
2      | Taqueria El                
2      | Twsp Prk Adam Golf Res     
2      | Taco Tuesday               
2      | Next Door Bakery           
...    | ...
Enter fullscreen mode Exit fullscreen mode

...So, yes, there might be a fried chicken problem in this household.

Thank you for reading!

I hope my experience has provided ideas for your own personal coding or expense tracking projects. In my next post, I will explore my "essential" and "non-essential" expenses, find large purchases that exceed the average transaction amount, and summarize quarterly spending. In the meantime, if you have any questions or suggestions please let me know.

If you made it this far, I have a question for you: "What kinds of personal projects did you do when you first started learning SQL?"

💖 💪 🙅 🚩
p_desch
Patrick Deschere

Posted on September 19, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related