Patrick Deschere
Posted on September 19, 2024
In this post, I share my experience practicing SQL by cleaning up data related to personal expense transactions, cleaning the data again (and again), and then finally performing a bit of analysis. I'm still a beginner and hope this article might be interesting to other newbies like myself.
Also! For the sake of writing this post and maintaining a bit of privacy, I made a "demo" copy of my expenses which includes modified dollar amounts and business names. That said... I did not attempt to hide any of my numerous fried chicken purchases 🙈
Goal: Practice SQL with data relevant to my life
Earlier this year, I completed an introductory SQL course that included a lot of exercises related to analyzing ye olde Olympics, Netflix, and FIFA databases. Afterward, I was hungry to continue practicing SQL, but with a set of data more personal and relevant to my life.
Like many people, I routinely spend money on stuff. My previous system for tracking expenses included a horrible Microsoft Excel sheet and too much time spent normalizing data. So I decided to see if SQL would be a better for cleaning, summarizing, and getting insights about my personal expenses.
Plus, I would be able to answer some truly important questions, like, "Which takeout restaurant do I get most frequently?" and "Do I have a fried chicken addiction?"
Part 1: Table design and creation
The first step was reviewing the available data, understanding my goals, and designing a table schema to support those goals.
Bank and credit card data
I downloaded a few CSV files containing bank account and credit card transactions spanning the past two years. These files contained some useful information related to each transaction:
- Transaction date
- Amount
- Description (e.g., TST* Fried Chicken Restaurant LLC)
- Type (e.g., Credit, Debit, Sale, Return)
- Category (e.g., Shopping, Groceries, Travel) automatically assigned by the credit card company. This column didn't exist in the files from the bank account.
Table schema planning
Before writing any code, I spent some time thinking about my goals and how I would use the database in the future:
- Short-term goal: Summarize spending by category, time period (e.g., year, quarter, month), and understand month "fixed" or "essential" spending.
- Long-term goal: If this project were approved by my wife (a business analyst herself), I might import her expenses into the system and understand how we're sharing household expenses.
Theoretically, these goals could be accomplished with one table. However, in order to retain flexibility (and avoiding bulk revising data later), I included two tables.
-------------------------------- ----------------------------
| transactions | | | accounts | |
| ---------------- | ----------- | | ------------ | ----------- |
| transaction_id | primary key | |‾‾‾| account_id | primary key |
| transaction_date | | | | account_name | |
| account_id | foreign key |‾‾‾ | account_type | |
| description | | | institution | |
| category | | | owner | |
| type | | ----------------------------
| amount | |
--------------------------------
The transactions
table would store individual records for each incoming and outgoing transaction belonging to all bank and credit card accounts. The second table accounts
would store extra information about each account (and perhaps someday would be used to know which expenses belong to accounts owned by me or my wife).
Create the tables!
You can see more details about the tables and fields by referencing the below code. For a newbie like me, the most exciting part was using DECIMAL(10,2)
for the amount
field, which would accommodate dollar and cent amounts.
CREATE TABLE accounts (
account_id INTEGER PRIMARY KEY
account_name VARCHAR(255) NOT NULL
account_type VARCHAR(50) NOT NULL
institution VARCHAR(50) NOT NULL
owner VARCHAR(50) NULL
);
CREATE TABLE transactions (
transaction_id SERIAL PRIMARY KEY,
transaction_date DATE NOT NULL,
account_id INTEGER REFERENCES accounts(account_id) NOT NULL
description VARCHAR(255) NOT NULL
category VARCHAR(255) NOT NULL
type VARCHAR(255) NOT NULL
amount DECIMAL(10,2) NOT NULL
);
Cleaning data without modifying the data
After creating the tables and importing about 1,500 transactions, I performed a simple query to review the data and get a sense of potential issues to fix.
SELECT
transaction_date,
description,
category,
type,
amount
FROM transactions_demo
ORDER BY transaction_date DESC
LIMIT 20;
transaction_date | description | category | type | amount
---------------- | ------------------------------------------ | ----------------- | ------- | --------
2024-07-31 | Uniqlo USA LLC | Shopping | Sale | -101.08
2024-07-31 | ETERNAL HOME LOANS 0000000000 WEB ID... | NULL | DEBIT | -2348.33
2024-07-31 | GOOGLE *FI DH0TCM | Bills & Utilities | Sale | -103.39
2024-07-30 | Credit Card Payment - Thank You | NULL | Payment | 1972.01
2024-07-30 | PAYMENT TO EMPIRE BANK CARD ENDING IN 0... | NULL | DEBIT | -1972.01
2024-07-30 | TTP* LOVELY DOGGO WALKO | Personal | Sale | -20.80
2024-07-29 | GOOGLE *Peacock | Shopping | Sale | -12.35
2024-07-29 | NEIGHBORS GROCERY OUT | Groceries | Sale | -31.57
2024-07-29 | AMC 0000 ONLINE | Entertainment | Sale | -39.78
2024-07-26 | TAQUERIA EL | Food & Drink | Sale | -24.89
2024-07-25 | SQ *JACK STRANDS SALON | Personal | Sale | -56.64
2024-07-25 | GREYHOUND VETERINARY H | Personal | Sale | -87.19
2024-07-24 | B&H PHOTO 000-000-0000 | Shopping | Sale | -70.58
2024-07-24 | EVIL INTERNET & CABLE CORP | Bills & Utilities | Sale | -77.25
... | ... | ... | ... | ...
As you see, the data required cleaning and standardization before it would be useful for analysis. Here are a few issues:
- The
description
column contained inconsistent formats, extra characters (e.g., strings of text related to payment processing services, random invoice numbers), and sometimes the same business appeared differently. - There were
NULL
values in thecategory
field for all bank transactions, because the bank account didn't automatically assign a category like the credit cards accounts did. - Also, while the
category
field was mostly correct, some transactions required a category different from the one assigned by the credit card. For example, I would rather assign dog food, vet bills, and dog walker fees to a newDog
expense category. - The data in the
type
column wasn't necessarily wrong, but I felt it would benefit from some consolidation. - There were a lot of transactions that would need to be ignored when summarizing spending habits. For example, transfers between my checking account and savings accounts, transfers to investment accounts, regular credit card payments.
Since I was still exploring the data and wasn't sure how I would use them in the future, I wanted to fix the above issues without modifying the data itself.
Consolidating transaction types
The default transaction types were not very intuitive, so I decided to consolidate them into three categories.
Here are the default types.
SELECT DISTINCT(type) FROM transactions_demo;
type
----------
CREDIT
Payment
Adjustment
CHECK
DSLIP
Sale
Return
Fee
DEBIT
I wrote the below CASE
clause to consolidate them into three types: In
, Out
, and Return
. Technically, returned purchases could be grouped with the other incoming transactions, but for now I wanted an easy to separate them -- just case I wanted to filter say salaries and other income but keep returned purchases.
CASE
WHEN type ~ 'Adjustment|CREDIT|DSLIP'
THEN 'In'
WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
THEN 'Out'
ELSE INITCAP(type) END AS clean_type
Below are the results with the new clean_type
added.
SELECT
description,
type,
CASE
WHEN type ~ 'Adjustment|CREDIT|DSLIP'
THEN 'In'
WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
THEN 'Out'
ELSE INITCAP(type) END AS clean_type
FROM clean_transactions
ORDER BY transaction_date DESC
LIMIT 20;
description | type | clean_type
------------------------------------ | ------ | ----------
TTP* LOVELY DOGGO WALKO | Sale | Out
BUZZ N WOODY PIZZA | Sale | Out
REGIONAL LANDMARK ENERGY INC | Sale | Out
TST*LUCKY BREWING - CAP | Sale | Out
ROBBER BARON INVESTMENTS PPD ID: ... | DEBIT | Out
TTP* LOVELY DOGGO WALKO | Sale | Out
GREYHOUND VETERINARY H | Sale | Out
NEW TECH NORTHWEST | Sale | Out
REMOTE ONLINE DEPOSIT # Â Â Â | DSLIP | In
TST* DIN TAI FUNG - BELLE | Sale | Out
ETERNAL HOME LOANS Â 0000000000Â Â | DEBIT | Out
GOOGLE *FI DH0TCM | Sale | Out
WALLSTREET BROS SAVINGS: 0000000000 | CREDIT | In
... | ... | ...
Fixing incorrect and NULL categories
Next on the list was cleaning up the NULL
values and wrong categories in the category
field. To get a sense of the categories, here are the distinct values assigned by my credit cards.
SELECT DISTINCT(category)
FROM transactions_demo
ORDER BY category;
category
---------------------
Automotive
Bills & Utilities
Education
Entertainment
Fees & Adjustments
Food & Drink
Gas
Gifts & Donations
Groceries
Health & Wellness
Home
Personal
Professional Services
Shopping
Travel
Identifying records with NULL
values was simple, but figuring out which transactions would need a different category took a little more time. For those records, I prioritized identifying which recurring transactions needed attention. To do so, I used the below query.
Some vendor names in the description
field were not consistent. For example, my mortgage servicing company appeared as ETERNAL HOME LOANS 000...
in some places, but also appeared as ETERNAL HOME LOANS PAYMENTS 000...
in some places. In order to group similar descriptions like these, I used the LEFT
function to only return the first 14 characters of each description.
SELECT
COUNT(*),
LEFT(description,14) AS short_description,
category
FROM transactions_demo
WHERE amount < 0
GROUP BY short_description, category
ORDER BY count DESC;
count | short_description | category
----- | ------------------| --------
83 | TTP* LOVELY DO | Personal
43 | PAYMENT TO EMP | NULL
39 | COSTCO WHSE #0 | Shopping
33 | WALLSTREET BRO | NULL
31 | CHEWY.COM | Shopping
29 | NEIGHBORS GROC | Groceries
27 | APPLE.COM/BILL | Shopping
26 | ROXY'S FRIED C | Food & Drink
25 | BUZZ N WOODY P | Food & Drink
24 | ETERNAL HOME L | NULL
23 | EVIL INTERNET | Bills & Utilities
23 | REGIONAL LANDM | Bills & Utilities
20 | SQ *NEXT DOOR | Food & Drink
19 | COSTCO GAS #00 | Gas
19 | COOKUNITY INC | Groceries
15 | SQ *JACK STRAN | Personal
14 | MCDONALD'S F00 | Food & Drink
14 | GREYHOUND VETE | Personal
14 | INVESTCORP Â BU | NULL
13 | AVG PARKING 00 | Travel
... | ... | ...
After investigating, I compiled a list of issues. Here are a couple:
- As mentioned previously, transactions associated with my bank account were all
NULL
and would need a new category assigned. For example, there were 24 transfers to my mortgage servicing companyETERNAL HOME LOANS
where the category should be changed fromNULL
toMortgage
. - There were 39 transactions to
COSTCO WHSE
assigned toShopping
. I wanted to change these toGroceries
since the vast majority of my purchases at Costco are food and groceries. - Also noted earlier, there were many transactions related to dog ownership, but they were assigned different categories like
Personal
andShopping
. I wanted to replace the automatically assigned categories with a newDog
category. - I wanted to consolidate
Automative
andGas
into one categoryAutomative & Gas
, so it would be easier to group expenses related to owning and driving my car.
Since there wasn't a huge number of recurring vendors and services, I decided the easiest solution would be to use another CASE
clause. For example, if the description contained CHEWY.COM
, GREYHOUND VETERINARY
, or LOVELY DOGGO WALKO
, then the category would be replaced with Dog
.
After reviewing the most frequently recurring descriptions, I identified several patterns to include in the below CASE
statement.
CASE
WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive'
THEN 'Automotive & Gas'
WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment'
THEN 'Credit card'
WHEN description ~ '^REDEMPTION CREDIT'
THEN 'Points cashback'
WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT'
THEN 'Dog'
WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock'
THEN 'Entertainment'
WHEN description ~ 'COSTCO WHSE|INSTACART'
THEN 'Groceries'
WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER'
THEN 'Travel'
WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #'
THEN 'Misc'
WHEN description ~ '^ETERNAL HOME LOANS'
THEN 'Mortgage'
WHEN description ~ '^DEPOSIT ID NUMBER|^IRS.*TAX REF'
THEN 'Other income'
WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
THEN 'Reimbursement'
WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer'
THEN 'Saving & Investing'
WHEN description ~ '^CORPO INC PAYROLL'
THEN 'Wages & Salaries'
ELSE category END AS clean_category
After dropping this into my query from last time...
SELECT
transaction_date,
description,
category,
type,
CASE
WHEN type ~ 'Adjustment|Return'
THEN 'Returns & Cashback'
WHEN type ~ 'CREDIT|DSLIP'
THEN 'Income & Deposits'
WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
THEN 'Payments & Transfers'
ELSE INITCAP(type) END AS clean_type,
CASE
WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive'
THEN 'Automotive & Gas'
WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment'
THEN 'Credit card'
WHEN description ~ '^REDEMPTION CREDIT'
THEN 'Points cashback'
WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT'
THEN 'Dog'
WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock'
THEN 'Entertainment'
WHEN description ~ 'COSTCO WHSE|INSTACART'
THEN 'Groceries'
WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER'
THEN 'Travel'
WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #'
THEN 'Misc'
WHEN description ~ '^ETERNAL HOME LOANS'
THEN 'Mortgage'
WHEN description ~ '^DEPOSIT ID NUMBER|^IRS.*TAX REF'
THEN 'Other income'
WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
THEN 'Reimbursement'
WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer'
THEN 'Saving & Investing'
WHEN description ~ '^CORPO INC PAYROLL'
THEN 'Wages & Salaries'
ELSE category END AS clean_category,
amount
FROM transactions_demo
WHERE transaction_date <= '2024-08-06'
AND description !~ 'NEW TECH NORTHWEST'
ORDER BY transaction_date DESC
LIMIT 20;
...I got the below example of the clean_category
column.
description | category | clean_category
----------------------------------------- | ----------------- | ---------------
ROBBER BARON INVESTMENTS PPD ID: 00000... | NULL | Saving & Investing
GREYHOUND VETERINARY H | Personal | Dog
TTP* LOVELY DOGGO WALKO | Personal | Dog
REMOTE ONLINE DEPOSIT # Â Â Â Â Â 0 | NULL | Reimbursement
TST* DIN TAI FUNG - BELLE | Food & Drink | Food & Drink
Uniqlo USA LLC | Shopping | Shopping
ETERNAL HOME LOANS Â 0000000000 Â Â Â W... | NULL | Mortgage
GOOGLE *FI DH0TCM | Bills & Utilities | Bills & Utilities
TTP* LOVELY DOGGO WALKO | Personal | Dog
Credit Card Payment - Thank You | NULL | Credit card
PAYMENT TO EMPIRE BANK CARD ENDING IN ... | NULL | Credit card
GOOGLE *Peacock | Shopping | Entertainment
AMC 0000 ONLINE | Entertainment | Entertainment
NEIGHBORS GROCERY OUT | Groceries | Groceries
WALLSTREET BROS SAVINGS: 0000000000 | NULL | Saving & Investing
TAQUERIA EL | Food & Drink | Food & Drink
GREYHOUND VETERINARY H | Personal | Dog
SQ *JACK STRANDS SALON | Personal | Personal
FANDANGO | Entertainment | Entertainment
TRVL CARD BANK Â ONLINE PMT CKF00000000... | NULL | Travel
Removing chaos from the descriptions
The description
field includes a note about each transaction. For example, it could be the vendor from which I purchased something or the name of a savings account to which I sent money.
The descriptions had numerous formatting inconsistencies and included extra characters that made it difficult to group them together. For example, each transaction related to Google Fi (my phone service) included a unique invoice code. Take a look below.
SELECT
transaction_date,
description
FROM transactions_demo
WHERE description ~* '^Google.+Fi'
ORDER BY transaction_date DESC
LIMIT 25;
transaction_date | description
---------------- | -----------------
2024-07-31 | GOOGLE *FI DH0TCM
2024-07-01 | GOOGLE *FI M0zWRQ
2024-05-31 | GOOGLE *FI 00CNX0
2024-05-01 | GOOGLE *FI 0GB00X
2024-03-31 | GOOGLE *FI 0S0HNC
2024-03-02 | GOOGLE *FI XKW00N
2024-01-31 | GOOGLE *FI D0J0RZ
2023-12-31 | GOOGLE *FI G0JRTL
2023-12-01 | GOOGLE *FI L0CQ00
2023-10-31 | GOOGLE *FI 0HFBR0
2023-10-01 | GOOGLE *FI vvRRSB
2023-08-31 | GOOGLE *FI CV00MN
2023-07-31 | GOOGLE *FI 00LRXC
... | ...
If I wanted to answer a question like "How much did I spend on phone service last year," standardizing the description to something like Google Fi
would make it easier to group the transactions.
Next, this is mainly an aesthetic problem, but take a look at dining and takeout transactions.
SELECT
description,
category
FROM transactions_demo
WHERE category = 'Food & Drink'
ORDER BY transaction_date DESC
LIMIT 20;
'description' | 'category'
------------------------- | ------------
MESOPOTAMIA MIX | Food & Drink
TST*LUCKY BREWING - CAP | Food & Drink
HOLE IN WALL THAI LLC | Food & Drink
TST*ROCKET TACO - TOWN | Food & Drink
CAFE BRIO | Food & Drink
SQ *RED FISH BLUE FISH | Food & Drink
SQ *CRAFT (VICTORIA) LTD. | Food & Drink
FRIENDS & FAMILY BAKE | Food & Drink
BUZZ N WOODY PIZZA | Food & Drink
TST* DIN TAI FUNG | Food & Drink
TAQUERIA EL | Food & Drink
TWSP PRK ADAM GOLF RES0 | Food & Drink
SQ *FATS WINGS AND WAFFLE | Food & Drink
SQ *BEANFISH | Food & Drink
SQ *FATS WINGS AND WAFFLE | Food & Drink
TST* RETURNER CINEMA | Food & Drink
... | ...
Restaurants and many other businesses use payment processing services like Square or Toast, which insert a short string like SQ *
and TST*
at the beginning of each description. I wanted to remove these extra strings (and other characters) to enhance readability of the data. Like this:
-
SQ *RED FISH BLUE FISH
-->RED FISH BLUE FISH
-
SQ *CRAFT (VICTORIA) LTD.
-->CRAFT VICTORIA
-
TST*LUCKY BREWING - CAP
-->LUCKY BREWING - CAP
Next I checked descriptions that included a few special characters:
SELECT
description,
category
FROM transactions_demo
WHERE description ~ '\*|\#|\-'
ORDER BY transaction_date DESC
LIMIT 25;
description | category
---------------------------------- | -----------------
00000 - 000 DEXTER JUNO | Travel
COSTCO WHSE #0000 | Shopping
REMOTE ONLINE DEPOSIT # 000000 | NULL
GOOGLE *Peacock | Shopping
B\&H PHOTO 000-000-0000 | Shopping
AMAZON MKTPL*RJ0DG00F0 | Shopping
LOWES #00000* | Home
COSTCO GAS #0000 | Gas
Based on that query, I wanted to remove extra characters (and numbers) like these:
-
COSTCO WHSE #0000
-->COSTCO WHSE
-
COSTCO GAS #0000
-->COSTCO GAS
-
LOWES #00000*
-->LOWES
And fix special characters like the ampersand in this description:
-
B&H PHOTO
-->B&H PHOTO
Finally, there were many transactions that included a mix of letters, numbers, and special characters at the end of the string. Look at the utter mess below.
SELECT description
FROM transactions_demo
WHERE description ~* '\d+'
ORDER BY transaction_date DESC
LIMIT 25;
description
--------------------------------------------------------------
ROBBER BARON INVESTMENTS PPD ID: 0000000000
ETERNAL HOME LOANS 0000000000 WEB ID: 0000000000
PAYMENT TO EMPIRE BANK CARD ENDING IN 0000 00/00
WALLSTREET BROS SAVINGS: 0000000000
TRVL CARD BANK ONLINE PMT CKF000000000POS WEB ID: 0000000000
APA TREAS 000 MISC PAY PPD ID: 0000000000
APA TREAS 000 MISC PAY PPD ID: 0000000000
...
...
In these situations, I wanted to change something like ETERNAL HOME LOANS 0000000000 WEB ID: 0000000000
to just ETERNAL HOME LOANS
.
Regular expressions to the rescue
By exploring the description column and reviewing the most frequently occurring records, I identified several patterns that should be deleted or fixed. From there, I wrote a a series of nested REPLACE
and REGEXP_REPLACE
clauses that would clean up the descriptions.
-- Change first letter to uppercase, rest of string lower case
TRIM(INITCAP(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REPLACE(
REPLACE(description,'''',''),
-- Fix ampersands (&)
'&', '&'),
-- Replace strings of letters and numbers after a special character (as long as it appears after the eighth char)
'(?<!^.{0,8})\W\d+.+$',' ''g'),
-- Replace certain patterns (e.g., 'TST*'), special characters, unwanted strings(e.g. 'LLC')
'^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
-- Replace single characters hanging at the end of a string (e.g. 'Veterinary H' --> 'Veterinary')
'\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),
-- Change multiple spaces into one space
'\s+', ' ', 'g')
)) AS clean_description
I assume someone with more experience would be able to write a more concise version, but I figured it was good enough for me. Let's take a look...
SELECT
transaction_date,
description,
TRIM(INITCAP(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REPLACE(
REPLACE(description,'''',''),
'&', '&'),
'(?<!^.{0,8})\W\d+.+$',' ''g'),
'^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
'\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),
'\s+', ' ', 'g')
)) AS clean_description
FROM transactions_demo
ORDER BY transaction_date DESC
LIMIT 45;
'description' | 'clean_description'
---------------------------------------- | -------------------------------------
TTP* LOVELY DOGGO WALKO | Lovely Doggo Walko
TST*LUCKY BREWING - CAP | Lucky Brewing Cap
00000 - 000 DEXTER JUNO | Dexter Juno
TTP* LOVELY DOGGO WALKO | Lovely Doggo Walko
EVIL INTERNET & CABLE CORP | Evil Internet & Cable Corp
HOLE IN WALL THAI LLC | Hole In Wall Thai
COSTCO WHSE #0000 | Costco Whse
COAST VICTORIA HOTE & | Coast Victoria Hote &
PY *FRS CLIPPER | Frs Clipper
CHEWY.COM | Chewy.Com
SQ *CRAFT (VICTORIA) LTD. | Craft Victoria
IRISH TIMES PUB | Irish Times Pub
FRIENDS & FAMILY BAKE | Friends & Family Bake
SQ *RED FISH BLUE FISH | Red Fish Blue Fish
ROBBER BARON INVESTMENTS PPD ID: 0000... | Robber Baron Investments
TTP* LOVELY DOGGO WALKO | Lovely Doggo Walko
GREYHOUND VETERINARY H | Greyhound Veterinary
REMOTE ONLINE DEPOSIT # 0 | Remote Online Deposit
ETERNAL HOME LOANS 0000000 WEB ID: ... | Eternal Home Loans
GOOGLE *FI DH0TCM | Google Fi
Uniqlo USA LLC | Uniqlo Usa
TTP* LOVELY DOGGO WALKO | Lovely Doggo Walko
PAYMENT TO EMPIRE BANK CARD ENDING IN... | Payment To Empire Bank Card Ending In
... | ...
...And it did a pretty GOOD job!!!!
However, it wasn't perfect. For example, let's look at purchases from Amazon.
description | clean_description
---------------------- | ------------------
AMAZON MKTPL*RJ0DG00F0 | Amazon Mktpl Rj Dg
AMZN Mktp US*R00000RQ0 | Amzn Mktp Us R Rq
AMAZON MKTPL*000GV0A00 | Amazon Mktpl
AMZN Mktp US*BZ0E00T00 | Amzn Mktp Us Bz E
Amazon.com | Amazon.Com
AMZN Mktp US*DK00L00F0 | Amzn Mktp Us Dk L
Amazon.com*R00Y00R00 | Amazon.Com R Y
...
For recurring vendors and services, I decided it would be faster to add some specialized rules to the CASE
statement. Back when I was fixing the category field earlier, I already made a list of frequently recurring vendors and patterns in the description
field, so it was pretty easy to create another clause as below.
TRIM(CASE
WHEN description ~ 'PAYBYPHONE'
THEN 'PayByPhone Parking'
WHEN description ~* 'GOOGLE.*FI'
THEN 'Google Fi'
WHEN description ~ 'LOVELY DOGGO WALKO'
THEN 'Lovely Doggo Walko'
WHEN description ~* 'PLAYSTATION'
THEN 'PlayStation Network'
WHEN description ~ 'SOUTHERN CHICKEN AND WAFF'
THEN 'Southern Chicken And Waffle'
WHEN description ~ 'NEIGHBORS GROCERY OUT'
THEN 'Neighbors Grocery Outlet'
WHEN description ~ 'AmazonStores|Amazon Fresh'
THEN 'Amazon Fresh'
WHEN description ~ 'WALLSTREET BROS'
THEN 'Wallstreet Bros Savings'
WHEN description ~* 'AMZN Mktp|Amazon.com|AMAZON MKTPL'
THEN 'Amazon.com'
WHEN description ~ 'ATM WITHDRAWAL'
THEN 'ATM Withdrawal'
WHEN description ~ 'ETERNAL HOME LOANS'
THEN 'Eternal Home Loans'
WHEN description ~ 'AIRBNB'
THEN 'Airbnb'
WHEN description ~ 'LYFT.*RIDE'
THEN 'Lyft Ride'
WHEN description ~* 'Uniqlo'
THEN 'Uniqlo'
ELSE INITCAP(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REPLACE(
REPLACE(description,'''',''),
'&', '&'),
'(?<!^.{0,8})\W\d+.+$',' ''g'),
'^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
'\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),
'\s+', ' ', 'g')
) END) AS clean_description
Move the query to a common table expression
At this point, the new clean_description
, clean_category
, and clean_type
columns fixed most of the inconsistent and missing data. It was time to start summarizing and analyzing my expenses! But first... let's examine one more important step.
When I tried to filter transactions using a WHERE
clause with the clean_category
field, as you can see below...
SELECT
transaction_date,
description,
CASE
WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive'
THEN 'Automotive & Gas'
WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment'
THEN 'Credit card'
WHEN description ~ '^REDEMPTION CREDIT'
THEN 'Points cashback'
WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT'
THEN 'Dog'
WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock'
THEN 'Entertainment'
WHEN description ~ 'PHO SUP SHOP|BLING BLING ASIAN MARKET'
THEN 'Food & Drink'
WHEN description ~ 'COSTCO WHSE|INSTACART'
THEN 'Groceries'
WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER'
THEN 'Travel'
WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #'
THEN 'Misc'
WHEN description ~ '^ETERNAL HOME LOANS'
THEN 'Mortgage'
WHEN description ~ '^DEPOSIT ID NUMBER|^IRS.*TAX REF'
THEN 'Other income'
WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
THEN 'Reimbursement'
WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer'
THEN 'Saving & Investing'
WHEN description ~ '^CORPO INC PAYROLL'
THEN 'Wages & Salaries'
ELSE category END AS clean_category,
amount
FROM transactions_demo
-- Only show results from the Food & Drink category
WHERE clean_category = 'Food & Drink'
...I encountered this error.
ERROR: column "clean_category" does not exist
LINE 98: WHERE clean_category = 'Food & Drink'
^
Why? Well, when resolving a query, SQL checks the content in the WHERE
clause before resolving the SELECT
portion of the statement (makes sense because we only want data matching the WHERE
clause to be processed and returned). Since the CASE
clause is processed along with the rest of the SELECT
section, it doesn't exist when WHERE
is resolved. So my clean_category
column didn't exist when the WHERE
clause was resolved.
To fix this issue, I moved the query to a CTE (common table expression). That way, my "clean" columns would be stored in something like a temporary table, and I would be able to reference the cleaned data when writing a WHERE
clause. Here is an example of the query after moving it to a CTE.
WITH
-- Common table expression (CTE) containing my cleaned data
clean_transactions AS (
SELECT
transaction_id,
transaction_date,
-- Replace common descriptions and remove extra characters and symbols
TRIM(CASE
WHEN description ~ 'PAYBYPHONE'
THEN 'PayByPhone Parking'
WHEN description ~* 'GOOGLE.*FI'
THEN 'Google Fi'
WHEN description ~ 'LOVELY DOGGO WALKO'
THEN 'Lovely Doggo Walko'
WHEN description ~* 'PLAYSTATION'
THEN 'PlayStation Network'
WHEN description ~ 'SOUTHERN CHICKEN AND WAFF'
THEN 'Southern Chicken And Waffle'
WHEN description ~ 'NEIGHBORS GROCERY OUT'
THEN 'Neighbors Grocery Outlet'
WHEN description ~ 'AmazonStores|Amazon Fresh'
THEN 'Amazon Fresh'
WHEN description ~ 'WALLSTREET BROS'
THEN 'Wallstreet Bros Savings'
WHEN description ~* 'AMZN Mktp|Amazon.com|AMAZON MKTPL'
THEN 'Amazon.com'
WHEN description ~ 'ATM WITHDRAWAL'
THEN 'ATM Withdrawal'
WHEN description ~ 'ETERNAL HOME LOANS'
THEN 'Eternal Home Loans'
WHEN description ~ 'AIRBNB'
THEN 'Airbnb'
WHEN description ~ 'LYFT.*RIDE'
THEN 'Lyft Ride'
WHEN description ~* 'Uniqlo'
THEN 'Uniqlo'
ELSE INITCAP(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REPLACE(
REPLACE(description,'''',''),
'&', '&'),
-- Replace strings of letters and numbers after a special character (as long as it appears after the eighth char)
'(?<!^.{0,8})\W\d+.+$',' ''g'),
-- Replace certain patterns (e.g., 'TST*'), special characters, unwanted strings(e.g. 'LLC')
'^SP\s|^[a-zA-Z]{2,3}\*|^[a-zA-Z]{2,3} \*|\s[a-zA-Z]{1,3}\d+.+$|\s[a-zA-Z]{3}\s[a-zA-Z]{2}\:\s.|\d|[^\w\s&.]|\sLLC$|\sINC$|\sLTD$|\sLLC\.$|\sINC\.$|\sInc\.$|\sLTD\.$', ' ', 'g'),
-- Replace single characters hanging at the end of a string (e.g. 'Veterinary H' --> 'Veterinary')
'\s[a-zA-Z]$|\s[a-zA-Z]\s+$','', 'g'),
-- Change multiple spaces into one space
'\s+', ' ', 'g')
) END) AS clean_description,
-- Fix wrong categories and add new ones (e.g., Dog, Mortgage, Wages & Salary)
CASE
WHEN description ~ 'DRIVER LICENSE RENEW|VEHICLE LICENSING|INSUROTRON|PARKING|SPOTHERO|PAYBYPHONE' OR category ~ 'Gas|Automotive'
THEN 'Automotive & Gas'
WHEN description ~ '^PAYMENT TO EMPIRE|Credit Card Payment'
THEN 'Credit card'
WHEN description ~ '^REDEMPTION CREDIT'
THEN 'Points cashback'
WHEN description ~ 'LOVELY DOGGO|GREYHOUND VETERINARY|LOCAL HOUND GROOMER|CHEWY.COM|WHO LET THE PETS OUT'
THEN 'Dog'
WHEN description ~ 'PLAYSTATION|NBA League|NETFLIX|YouTube|NVIDIA|APPLE.COM/BILL|GOOGLE \*Peacock'
THEN 'Entertainment'
WHEN description ~ 'COSTCO WHSE|INSTACART'
THEN 'Groceries'
WHEN description ~ 'TRVL CARD BANK|PARKSRESERVATIONSWEB|TRUSTEDTRAVELER'
THEN 'Travel'
WHEN description ~ 'ATM WITHDRAWAL|FOREIGN EXCHANGE RATE|NON-EMPIRE BANK ATM|^CHECK #'
THEN 'Misc'
WHEN description ~ '^ETERNAL HOME LOANS'
THEN 'Mortgage'
WHEN description ~ '^DEPOSIT ID NUMBER|^IRS.*TAX REF'
THEN 'Other income'
WHEN description ~ '^CORPO INC ACH PMT|^VENMO|^Zelle|^REMOTE ONLINE DEPOSIT'
THEN 'Reimbursement'
WHEN description ~ '^WALLSTREET BROS|^INVESTCORP|ROBBER BARON INVESTM|APA.*TREAS|^Online Transfer'
THEN 'Saving & Investing'
WHEN description ~ '^CORPO INC PAYROLL'
THEN 'Wages & Salaries'
ELSE category END AS clean_category,
-- Group transaction type to simplify grouping later
CASE
WHEN type ~ 'Adjustment|Return'
THEN 'Returns & Cashback'
WHEN type ~ 'CREDIT|DSLIP'
THEN 'Income & Deposits'
WHEN type ~ 'CHECK|DEBIT|Fee|Payment|Sale'
THEN 'Payments & Transfers'
ELSE INITCAP(type) END AS clean_type,
amount,
-- Add year, quarter, and month for summarizing transactions later
EXTRACT(YEAR FROM transaction_date) AS year,
TO_CHAR(transaction_date,'YYYY"_Q"Q') as qtr_string,
TO_CHAR(transaction_date,'YYYY_MM') as month,
account_id
FROM transactions_demo
)
-- Query referencing the new 'clean_transactions' CTE
SELECT
transaction_date,
clean_description,
amount
FROM clean_transactions
WHERE clean_category = 'Food & Drink'
ORDER BY transaction_date DESC;
transaction_date | clean_description | amount
---------------- | --------------------- | -------
2024-08-21 | Buzz N Woody Pizza | -50.33
2024-08-17 | Lucky Brewing Cap | -7.42
2024-08-17 | Mesopotamia Mix | -29.06
2024-08-13 | Hole In Wall Thai | -52.50
2024-08-12 | Rocket Taco Town | -33.59
2024-08-11 | Coast Victoria Hote & | -42.62
2024-08-10 | Cafe Brio | -121.86
2024-08-09 | Friends & Family Bake | -7.90
2024-08-09 | Red Fish Blue Fish | -32.86
2024-08-09 | Irish Times Pub | -74.75
2024-08-09 | Craft Victoria | -31.78
2024-08-03 | Din Tai Fung Belle | -40.10
2024-07-26 | Taqueria El | -24.89
... | ... | ...
Yes!!! With the clean_transactions
CTE, I would be able to finally start summarizing and analyzing the transactions.
A couple quick queries
To start, let's look at the total spend per category along with the percentage of total spend that year.
WITH
...
...
...
SELECT
year,
clean_category,
SUM(amount) AS total_spend,
ROUND(SUM(amount) / (SUM(SUM(amount)) OVER(PARTITION BY year))*100,2) AS pct_of_spend
FROM clean_transactions
WHERE clean_category NOT IN ('Wages & Salaries','Credit card','Saving & Investing','Other income','Points cashback')
GROUP BY year, clean_category
ORDER BY year DESC, total_spend;
year | clean_category | total_spend | pct_of_spend
---- | --------------------- | ----------- | ------------
2024 | Mortgage | -16196.71 | 38.78
2024 | Shopping | -5309.93 | 12.71
2024 | Travel | -4350.83 | 10.42
2024 | Dog | -3255.04 | 7.79
2024 | Bills & Utilities | -3208.90 | 7.68
2024 | Food & Drink | -3021.83 | 7.24
2024 | Groceries | -1830.94 | 4.38
2024 | Automotive & Gas | -1446.29 | 3.46
2024 | Misc | -760.00 | 1.82
2024 | Home | -715.45 | 1.71
2024 | Entertainment | -687.91 | 1.65
2024 | Personal | -543.92 | 1.30
2024 | Professional Services | -393.02 | 0.94
2024 | Health & Wellness | -170.78 | 0.41
2024 | Gifts & Donations | -76.00 | 0.18
... | ... | ... | ...
Or, just for fun, I could check my most frequented restaurants in 2024.
WITH
...
...
...
SELECT
COUNT(clean_description) AS visits,
clean_description AS restaurant
FROM clean_transactions
WHERE clean_category IN('Food & Drink')
AND year = '2024'
GROUP BY clean_description
ORDER BY visits DESC;
visits | restaurant
------ | ---------------------------
11 | Roxys Fried Chicken
6 | Buzz N Woody Pizza
4 | Mesopotamia Mix
4 | Mcdonalds
3 | Southern Chicken And Waffle
2 | Fantasy Donut Force
2 | Taqueria El
2 | Twsp Prk Adam Golf Res
2 | Taco Tuesday
2 | Next Door Bakery
... | ...
...So, yes, there might be a fried chicken problem in this household.
Thank you for reading!
I hope my experience has provided ideas for your own personal coding or expense tracking projects. In my next post, I will explore my "essential" and "non-essential" expenses, find large purchases that exceed the average transaction amount, and summarize quarterly spending. In the meantime, if you have any questions or suggestions please let me know.
If you made it this far, I have a question for you: "What kinds of personal projects did you do when you first started learning SQL?"
Posted on September 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.