Implementing months_between() function in BigQuery
Niko Draganić
Posted on January 19, 2021
After 6 years of BI experience in Oracle DB, I came to Bornfight to develop their BI system using Google Cloud Platform. Having used very little of it, I was glad to see all the major features, such as aggregate and analytic functions, waiting for me. I am still adjusting to not having full procedural language (PL) capabilities, but that may be for the better, since building Agile BI can be a very long lasting process using only code.
My first assignment was to create a report for Employee Loyalty Bonuses, which are calculated based solely on the duration of one's employment; a 10 minute task using Oracle's months_between()
function. So I immediately looked up if BigQuery had any similar functions and I found the built-in date_diff()
, which calculates the difference between two dates in days, months, or years. So far so good. I tested it out on some examples and realized it give the literal difference in the year/month part of the date, ie. Jan 1st to Feb 28th returns 1, as does Jan 31st to Feb 1st. What I wanted was, of course, something very close to 2 in the first case and something very close to 0 in the second one. So I got to thinking.
(skip to the end for the final code)
The Logic
The function needs to take two arguments: DAT_TO and DAT_FROM (in that particular order because of legacy and intuition of subtracting the earlier/older date from the later/newer one).
So what are the rules?
1. If day_in_month
is the same, the function returns an integer
The first one is obvious: Jan 1st to Feb 1st should be 1 month, as should Feb 2nd to Mar 2nd, because month is a relative term of variable duration. Same goes for every pair of dates with the same value in their day fields. This is also where date_diff()
does the job correctly. However, not all the dates in my dataset will be on the same day of the month.
2. If DAT_TO.day_in_month
is EOM and DAT_FROM.day_in_month
is EOM, the function returns an integer.
Consider the following: Jan 31st to Mar 31st returns 2, as per rule #1. So what should the function return if DAT_FROM were Feb 28th (in a non-leap year)? It should be 1, because even though day_in_month
values are not the same, they are both ends of their respective months.
3. Monotony
So, in respect to Feb 28 as the end-of-month, we have Mar 28th returning 1 (rule #1) and Mar 31st returning 1 (rule #2). What about Mar 29th and 30th? Well, it would be rather weird for this to equal 1 on the 28th, then be greater than 1 on the 29th, then be even greater on the 30th and then return to 1 on 31st of March. We need this function to be monotonically increasing, which means that going further in the future either increases or holds the value of the function, never decreases it. Combined with #2, it leaves us with this:
If DAT_TO.day_in_month
< DAT_FROM.day_in_month
and DAT_TO
is EOM then return an integer.
4. Inverse of rules 2 and 3
The logic applied in #2 and #3 goes both ways. Without a detailed explanation, If DAT_FROM.day_in_month
< DAT_TO.day_in_month
and DAT_FROM
is EOM then return an integer.
5. Non-integer differences
So, rules #1-#4 give us the cases when the result is an integer. What about everything else? The dates inside the same month are easy: Distance between Jan 1st and Jan 15th is fourteen days, over 31 days in total for January, which gives us 0.45 of a month. But what if the dates are not in the same month? Consider Feb 14th to Mar 15th. The value should be slightly over 1, because rule #1 gives us exactly 1, and this is one day longer. But how do we proceed? We have half of February ((28-14)/28) to go and slightly less than half of March (15/31) to get from Feb 14th to Mar 15th. But adding those two fractions does not give us a value over 1, in fact, it's less than 1. That's why we should always denominate by the total number of days in the month of DAT_FROM
. This gives us (14+15)/28, which is > 1. And it works for all pairs of dates, because as we get to the end of the month, we are covered by rules #3 and #4.
6. Negative distance for reversed dates
For legacy reasons (to be aligned with Oracle function, that is), when DAT_FROM
> DAT_TO
, the returned value should be negative.
The Code
As with every SQL query, there are multiple ways to get this result. Here's how I did it:
1. Extract year, month and day digits
Simply have the day, month and year values as integers.
SELECT
calculation_date AS calculation_date,
CAST(FORMAT_DATETIME("%Y", DATETIME(calculation_date)) AS INT64) AS year_num,
CAST(FORMAT_DATETIME("%m", DATETIME(calculation_date)) AS INT64) AS month_num,
CAST(FORMAT_DATETIME("%d", DATETIME(calculation_date)) AS INT64) AS day_num,
FORMAT_DATETIME("%Y%m", DATETIME(calculation_date)) AS ym
FROM (
SELECT DAT_FROM AS calculation_date UNION ALL
SELECT DAT_TO AS calculation_date
)
2. Calculate total days in month and distance to BOM and EOM
Calculate percentages, or rather, percentiles of month passed and month left.
SELECT
calculation_date,
year_num,
month_num,
day_num,
ym,
days_in_month,
round(distance_to_bom/days_in_month, 4) AS pct_month_passed,
round(distance_to_eom/days_in_month, 4) AS pct_month_left
FROM (
SELECT
*,
date_diff(last_day(calculation_date), date_trunc(calculation_date, MONTH), DAY) + 1 AS days_in_month,
date_diff(calculation_date, date_trunc(calculation_date, MONTH), DAY) AS distance_to_bom,
date_diff(last_day(calculation_date), calculation_date, DAY) AS distance_to_eom
FROM (
--#1
)
)
3. Pivot
Since we started by having two rows, we want all values in a single row, so we pivot by maximizing for both DAT_FROM and DAT_TO input values.
SELECT
max(CASE WHEN calculation_date = DAT_TO THEN calculation_date END) AS calc_dat_to,
max(CASE WHEN calculation_date = DAT_TO THEN year_num END) AS year_num_to,
max(CASE WHEN calculation_date = DAT_TO THEN month_num END) AS month_num_to,
max(CASE WHEN calculation_date = DAT_TO THEN day_num END) AS day_num_to,
max(CASE WHEN calculation_date = DAT_TO THEN days_in_month END) AS days_in_month_to,
max(CASE WHEN calculation_date = DAT_TO THEN pct_month_passed END) AS pct_month_passed_to,
max(CASE WHEN calculation_date = DAT_TO THEN pct_month_left END) AS pct_month_left_to,
max(CASE WHEN calculation_date = DAT_FROM THEN calculation_date END) AS calc_dat_from,
max(CASE WHEN calculation_date = DAT_FROM THEN year_num END) AS year_num_from,
max(CASE WHEN calculation_date = DAT_FROM THEN month_num END) AS month_num_from,
max(CASE WHEN calculation_date = DAT_FROM THEN day_num END) AS day_num_from,
max(CASE WHEN calculation_date = DAT_FROM THEN days_in_month END) AS days_in_month_from,
max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_passed END) AS pct_month_passed_from,
max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_left END) AS pct_month_left_from
FROM (
--#2
)
sql
4. Implement the logic
This just implements the logic stated above.
SELECT
CASE
WHEN calc_dat_from = calc_dat_to
THEN 0
WHEN calc_dat_from < calc_dat_to
THEN CASE
WHEN day_num_to = day_num_from
THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
WHEN day_num_to < day_num_from AND pct_month_left_to = 0
THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
WHEN day_num_to > day_num_from AND pct_month_left_from = 0
THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
ELSE pct_month_left_from + (date_diff(calc_dat_to, calc_dat_from, MONTH) - 1) + (day_num_to/days_in_month_from)
END
WHEN calc_dat_from > calc_dat_to
THEN CASE
WHEN day_num_from = day_num_to
THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
WHEN day_num_from < day_num_to AND pct_month_left_from = 0
THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
WHEN day_num_from > day_num_to AND pct_month_left_to = 0
THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
ELSE -1*( pct_month_left_to + (date_diff(calc_dat_from, calc_dat_to, MONTH) - 1) + (day_num_from/days_in_month_to) )
END
END AS result
FROM (
--#3
)
Final code
Here's the entire code for the function for those who want results without the background story.
CREATE OR REPLACE FUNCTION sys.months_between(
DAT_TO DATE,
DAT_FROM DATE
) RETURN FLOAT64
AS (
SELECT
CASE
WHEN calc_dat_from = calc_dat_to
THEN 0
WHEN calc_dat_from < calc_dat_to
THEN CASE
WHEN day_num_to = day_num_from
THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
WHEN day_num_to < day_num_from AND pct_month_left_to = 0
THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
WHEN day_num_to > day_num_from AND pct_month_left_from = 0
THEN (year_num_to-year_num_from)*12 + month_num_to - month_num_from
ELSE pct_month_left_from + (date_diff(calc_dat_to, calc_dat_from, MONTH) - 1) + (day_num_to/days_in_month_from)
END
WHEN calc_dat_from > calc_dat_to
THEN CASE
WHEN day_num_from = day_num_to
THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
WHEN day_num_from < day_num_to AND pct_month_left_from = 0
THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
WHEN day_num_from > day_num_to AND pct_month_left_to = 0
THEN -1*( (year_num_from-year_num_to)*12 - month_num_from - month_num_to )
ELSE -1*( pct_month_left_to + (date_diff(calc_dat_from, calc_dat_to, MONTH) - 1) + (day_num_from/days_in_month_to) )
END
END AS result
FROM (
SELECT
max(CASE WHEN calculation_date = DAT_TO THEN calculation_date END) AS calc_dat_to,
max(CASE WHEN calculation_date = DAT_TO THEN year_num END) AS year_num_to,
max(CASE WHEN calculation_date = DAT_TO THEN month_num END) AS month_num_to,
max(CASE WHEN calculation_date = DAT_TO THEN day_num END) AS day_num_to,
max(CASE WHEN calculation_date = DAT_TO THEN days_in_month END) AS days_in_month_to,
max(CASE WHEN calculation_date = DAT_TO THEN pct_month_passed END) AS pct_month_passed_to,
max(CASE WHEN calculation_date = DAT_TO THEN pct_month_left END) AS pct_month_left_to,
max(CASE WHEN calculation_date = DAT_FROM THEN calculation_date END) AS calc_dat_from,
max(CASE WHEN calculation_date = DAT_FROM THEN year_num END) AS year_num_from,
max(CASE WHEN calculation_date = DAT_FROM THEN month_num END) AS month_num_from,
max(CASE WHEN calculation_date = DAT_FROM THEN day_num END) AS day_num_from,
max(CASE WHEN calculation_date = DAT_FROM THEN days_in_month END) AS days_in_month_from,
max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_passed END) AS pct_month_passed_from,
max(CASE WHEN calculation_date = DAT_FROM THEN pct_month_left END) AS pct_month_left_from
FROM (
SELECT
calculation_date,
year_num,
month_num,
day_num,
ym,
days_in_month,
round(distance_to_bom/days_in_month, 4) AS pct_month_passed,
round(distance_to_eom/days_in_month, 4) AS pct_month_left
FROM (
SELECT
*,
date_diff(last_day(calculation_date), date_trunc(calculation_date, MONTH), DAY) + 1 AS days_in_month,
date_diff(calculation_date, date_trunc(calculation_date, MONTH), DAY) AS distance_to_bom,
date_diff(last_day(calculation_date), calculation_date, DAY) AS distance_to_eom
FROM (
SELECT
calculation_date AS calculation_date,
CAST(FORMAT_DATETIME("%Y", DATETIME(calculation_date)) AS INT64) AS year_num,
CAST(FORMAT_DATETIME("%m", DATETIME(calculation_date)) AS INT64) AS month_num,
CAST(FORMAT_DATETIME("%d", DATETIME(calculation_date)) AS INT64) AS day_num,
FORMAT_DATETIME("%Y%m", DATETIME(calculation_date)) AS ym
FROM (
SELECT DAT_FROM AS calculation_date UNION ALL
SELECT DAT_TO AS calculation_date
)
)
)
)
)
)
Enjoy!
Posted on January 19, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.