mwang-cmn
Posted on August 17, 2024
Introduction
Data Bank, a cutting-edge digital bank, is pioneering the integration of banking and distributed data storage. By linking customers' cloud storage limits to their account balances, Data Bank offers a unique blend of financial services and secure data storage. As the digital banking landscape evolves, understanding customer behavior and forecasting data needs are crucial for strategic growth and efficient resource allocation.
Problem Statement
This project aims to explore customer nodes and transaction patterns within the Data Bank system to identify key metrics that will help the management team optimize customer allocation, improve transaction tracking, and accurately forecast future data storage requirements. View the data challenge here
Data Structure and SQL Environment
The dataset used in this analysis was intergrated into a SQL Server database. The database structure is designed with clarity featuring three key tables:
1. Regions:
2. Customer Nodes
3. Customer Transactions
The Entity Relationship Diagram for this dataset is as follows:
Data Cleaning
The dataset has been thouroughly examined and all three tables are clean and suited for analysis. There are no null values, duplicates or incorrect data types.
Question and Answers
- How many unique nodes are there on the Data Bank system?
SELECT COUNT(DISTINCT node_id) as unique_nodes
FROM dbo.customer_nodes;
- What is the number of nodes per region?
SELECT r.region_name, COUNT(c.node_id) as nodes
FROM customer_nodes c
INNER JOIN regions r
ON c.region_id = r.region_id
GROUP BY r.region_name;
How many customers are allocated to each region?
Counts the unique number of customers from each Region, in descending order. Australia has the largest number of customers
SELECT r.region_name, COUNT(DISTINCT c.customer_id) as customer_count
FROM customer_nodes c
INNER JOIN regions r
ON c.region_id = r.region_id
GROUP BY r.region_name
ORDER BY customer_count DESC;
How many days on average are customers reallocated to a different node?
SELECT AVG(DATEDIFF(DAY, start_date, end_date)) AS avg_days
FROM customer_nodes
WHERE end_date != '9999-12-31';
What is the median, 80th and 95th percentile for this same reallocation days metric for each region?
The CTE date_diff calculates the number of days each customer spends on a node before being reallocated, linking this data to specific regions. The query then computes the median (50th percentile), 80th percentile, and 95th percentile of these reallocation days for each region
WITH date_diff AS(
SELECT c.customer_id,
r.region_name,
r.region_id,
DATEDIFF(DAY, start_date, end_date) AS reallocation_days
FROM customer_nodes c
INNER JOIN regions r
ON c.region_id = r.region_id
WHERE end_date != '9999-12-31'
)
SELECT DISTINCT region_id,
region_name,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY reallocation_days) OVER (PARTITION BY region_name) AS median_days,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY reallocation_days) OVER (PARTITION BY region_name) AS eighty_perc_days,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY reallocation_days) OVER (PARTITION BY region_name) AS ninety_five_perc_days
FROM date_diff
ORDER BY region_id, region_name;
What is the unique count and total amount for each transaction type?
SELECT txn_type AS transaction_type,
COUNT(*) AS unique_transactions,
SUM(txn_amount) AS total_amount
FROM customer_transactions
GROUP BY txn_type
ORDER BY unique_transactions DESC, total_amount DESC;
There were 4,777 unique transactions during the period, 56% of which were deposits. Additionally, the total amount transacted during the period was $2,958,708.
What is the average total historical deposit counts and amounts for all customers?
The CTE customer_deposits calculates the total number of deposits and the total amount of money deposited for each customer in the customer_transactions table. Itfilters the transactions to include only those where the transaction type is 'deposit', then grouping the results by each customer_id.
WITH customer_deposits AS (
SELECT customer_id,
COUNT(txn_type) as deposit_count,
SUM(txn_amount) as deposit_amount
FROM customer_transactions
WHERE txn_type='deposit'
GROUP BY customer_id
)
SELECT AVG(deposit_count) AS avg_count,
AVG(deposit_amount) AS avg_amount
FROM customer_deposits;
For each month - how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?
The CTE customer_trends tracks customer transaction behavior on a monthly basis. It calculates, for each month and each customr id:
- The number of deposits (deposit_count)
- The number of purchases (purchase_count)
- The number of withdrawals (withdrawal_count) The main query then determines, for each month, how many unique customers, made more than 1 deposit (deposit_count > 1).Additionally, made either at least 1 purchase or at least 1 withdrawal (purchase_count > 0 OR withdrawal_count > 0) in the same month.
WITH customer_trends AS
(
SELECT customer_id,
DATEPART(MONTH, txn_date) AS month_id,
DATENAME(MONTH, txn_date) AS month_name,
COUNT(CASE WHEN txn_type = 'deposit' THEN 1 END) AS deposit_count,
COUNT(CASE WHEN txn_type = 'purchase' THEN 1 END) AS purchase_count,
COUNT(CASE WHEN txn_type='withdrawal' THEN 1 END) AS withdrawal_count
FROM customer_transactions
GROUP BY customer_id, DATEPART(MONTH, txn_date),DATENAME(MONTH, txn_date)
)
SELECT month_id,
month_name,
COUNT(DISTINCT customer_id) AS total_customers
FROM customer_trends
WHERE deposit_count>1 AND (purchase_count>0 OR withdrawal_count>0)
GROUP BY month_id, month_name
ORDER BY month_id, month_name;
This analysis identifies customers who are more actively engaged by tracking those who not only deposit money multiple times within a month but also use the account for other types of transactions like purchases or withdrawals.
What is the closing balance for each customer at the end of the month?
WITH cashflows AS (
SELECT
customer_id,
DATEPART(MONTH, txn_date) AS month_id, -- Use month number for proper ordering
DATENAME(MONTH, txn_date) AS month_name,
SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) AS inflow
FROM
customer_transactions
GROUP BY
customer_id, DATEPART(MONTH, txn_date), DATENAME(MONTH, txn_date)
)
SELECT
customer_id,
month_name,
SUM(inflow) OVER (
PARTITION BY customer_id
ORDER BY month_id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS closing_balance
FROM
cashflows
ORDER BY
customer_id, month_id;
To calculate the closing balance for each customer as of the maximum transaction date (txn_date) within each month;
WITH cashflows AS (
SELECT
customer_id,
txn_date,
SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) AS inflow
FROM
customer_transactions
GROUP BY
customer_id, txn_date
),
latest_cashflows AS (
SELECT
customer_id,
txn_date,
SUM(inflow) OVER (PARTITION BY customer_id ORDER BY txn_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS closing_balance,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY txn_date DESC) AS rn
FROM
cashflows
)
SELECT
customer_id,
txn_date,
closing_balance
FROM
latest_cashflows
WHERE
rn = 1
ORDER BY
customer_id;
cashflows CTE:
This CTE calculates the net inflow for each customer_id and each txn_date. The result includes the transaction date (txn_date) along with the month information.latest_cashflows CTE:
In this step, the ROW_NUMBER() function is used to assign a rank (rn) to each row within the partition of customer_id and month_id, ordered by txn_date in descending order. This ensures that the latest transaction date within each month is given the rank of 1.Final Query:
The final SELECT statement retrieves the closing balance (inflow) for the latest transaction date in each month by filtering for rn = 1.
What is the percentage of customers who increase their closing balance by more than 5%?
What is the percentage of customers who increase their closing balance by more than 5%?
- The cashflows CTE calculates the net cash flow for each customer per month by summing up the transaction amounts, treating deposits as positive inflows and other transactions as outflows.
- ClosingBalance CTE:calculates the running (cumulative) closing balance for each customer over time by summing the monthly inflows up to and including the current month.
- PercentChange CTE: This calculates the percentage change in the closing balance for each customer from one month to the next, using the LAG function to compare the closing balance of the current month with the previous month.
WITH cashflows AS (
SELECT customer_id,
DATEPART(YEAR, txn_date) * 12 + DATEPART(MONTH, txn_date) AS month_id,
SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -txn_amount END) AS inflow
FROM
customer_transactions
GROUP BY
customer_id, DATEPART(YEAR, txn_date) * 12 + DATEPART(MONTH, txn_date)
),
ClosingBalance AS
(
SELECT customer_id,
month_id,
SUM(inflow) OVER (PARTITION BY customer_id ORDER BY month_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS closing_balance
FROM cashflows
),
PercentChange AS
(
SELECT customer_id,
month_id,
closing_balance,
100 * (closing_balance - LAG(closing_balance) OVER (PARTITION BY customer_id ORDER BY month_id)) / NULLIF(LAG(closing_balance) OVER (PARTITION BY customer_id ORDER BY month_id), 0) AS percent_increase
FROM ClosingBalance
)
SELECT 100 * COUNT(DISTINCT customer_id) / CAST((SELECT COUNT(DISTINCT customer_id) FROM customer_transactions) AS float) AS percent_customers
FROM PercentChange
WHERE percent_increase > 5;
Full sql script - Github
Posted on August 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.