Forecasting of periodic events with ML

annpastushko

Anna Pastushko

Posted on July 7, 2022

Forecasting of periodic events with ML

Periodic events forecasting is quite useful if you are, for example, the data aggregator. Data aggregators or data providers are organizations that collect statistical, financial or any other data from different sources, transform it and then offer it for further analysis and exploration (data as a service).

Data as a Service (DaaS)

It is really important for such organizations to monitor release dates in order to gather data as soon as it is released in the world and plan capacity to handle the incoming volumes of data.
Sometimes authorities that publish data have a schedule of future releases, sometimes not. In some cases, they announce schedule only for the next one or two months and, hence, you may want to make the publication schedule by yourself and predict release dates.
For the majority of statistical releases, you may find a pattern like the day of the week or month. For example, statistics can be released

  • every last working day of the month,
  • every third Tuesday of the month,
  • every last second working day of the month, etc.

Having this in mind and previous history of release dates, we want to predict potential date or range of dates when the next data release might happen.

Case Study

As a case study let’s take the U.S. Conference Board (CB) Consumer Confidence Indicator. It is a leading indicator which measures the level of consumer confidence in economic activity. By using it, we can predict consumer spending, which plays a major role in overall economic activity.

The official data provider does not provide the schedule for this series, but many data aggregators like Investing.com have been collecting the data for a while and series’ release history is available there.

Goal: we need to predict what is the date of the next release(s).

Data preparation

We start with importing all packages for data manipulation, building machine learning models, and other data transformations.

# Data manipulation
import pandas as pd# Manipulation with dates
from datetime import date
from dateutil.relativedelta import relativedelta# Machine learning
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
Enter fullscreen mode Exit fullscreen mode

The next step is to get the list of release history dates. You may have a database with all data and history of release dates that you can use. To make this simple and focus on release dates prediction I will add history to DataFrame manually.

data = pd.DataFrame({'Date': ['2021-01-26','2020-12-22',
                     '2020-11-24','2020-10-27','2020-09-29',
                     '2020-08-25','2020-07-28','2020-06-30',
                     '2020-05-26','2020-04-28','2020-03-31',
                     '2020-02-25','2020-01-28','2019-12-31',
                     '2019-11-26','2019-10-29','2019-09-24',
                     '2019-08-27','2019-07-30','2019-06-25',
                     '2019-05-28']})
Enter fullscreen mode Exit fullscreen mode

We also should add a column with 0 and 1 values to specify if release happened on this date. For now, we only have dates of releases, so we create a column filled with 1 values.

data['Date'] = pd.to_datetime(data['Date'])
data['Release'] = 1
Enter fullscreen mode Exit fullscreen mode

After that, you need to create all rows for dates between releases in DataFrame and fill release column with zeros for them.

r = pd.date_range(start=data['Date'].min(), end=data['Date'].max())
data = data.set_index('Date').reindex(r).fillna(0.0)
       .rename_axis('Date').reset_index()
Enter fullscreen mode Exit fullscreen mode

Now dataset is ready for further manipulations.

Feature engineering

Prediction of next release dates heavily relies on feature engineering because actually, we do not have any features besides release date itself. Therefore, we will create the following features:

  • month
  • a calendar day of the month
  • working day number
  • day of the week
  • week of month number
  • monthly weekday occurrence (second Wednesday of the month)
data['Month'] = data['Date'].dt.month
data['Day'] = data['Date'].dt.day
data['Workday_N'] = np.busday_count(
                    data['Date'].values.astype('datetime64[M]'),
                    data['Date'].values.astype('datetime64[D]'))
data['Week_day'] = data['Date'].dt.weekday
data['Week_of_month'] = (data['Date'].dt.day
                         - data['Date'].dt.weekday - 2) // 7 + 2
data['Weekday_order'] = (data['Date'].dt.day + 6) // 7
data = data.set_index('Date')
Enter fullscreen mode Exit fullscreen mode

Training Machine learning model

By default, we need to split our dataset into two parts: train and test. Don’t forget to set shuffle argument to False, because our goal is to create a forecast based on past events.

x_train, x_test, y_train, y_test = train_test_split(data.drop(['Release'], axis=1), data['Release'],
                 test_size=0.3, random_state=1, shuffle=False)
Enter fullscreen mode Exit fullscreen mode

In general, shuffle helps to get rid of overfitting by choosing different training observations. But it is not our case, every time we should have all history of publication events.

In order to choose the best prediction model, we will test the following models:

  • XGBoost
  • K-nearest Neighbors (KNN)
  • RandomForest

XGBoost

We will use XGBoost with tree base learners and grid search method to choose the best parameters. It searches over all possible combinations of parameters and chooses the best based on cross-validation evaluation.

A drawback of this approach is a long computation time.

Alternatively, the random search can be used. It iterates over the given range given the number of times, choosing values randomly. After a certain number of iterations, it chooses the best model.

However, when you have a large number of parameters, random search tests a relatively low number of combinations. It makes finding a *really*optimal combination almost impossible.

To use grid search you need to specify the list of possible values for each parameter.

DM_train = xgb.DMatrix(data=x_train, label=y_train)
grid_param = {"learning_rate": [0.01, 0.1],
              "n_estimators": [100, 150, 200],
              "alpha": [0.1, 0.5, 1],
              "max_depth": [2, 3, 4]}
model = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=model, param_grid=grid_param,
                       scoring="neg_mean_squared_error",
                       cv=4, verbose=1)
grid_mse.fit(x_train, y_train)
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
Enter fullscreen mode Exit fullscreen mode

As you see the best parameters for our XGBoost model are: alpha = 0.5, n_estimators = 200, max_depth = 4, learning_rate = 0.1.

Let’s train the model with obtained parameters.

xgb_model = xgb.XGBClassifier(objective ='reg:squarederror',
                            colsample_bytree = 1,
                            learning_rate = 0.1,
                            max_depth = 4,
                            alpha = 0.5,
                            n_estimators = 200)
xgb_model.fit(x_train, y_train)
xgb_prediction = xgb_model.predict(x_test)
Enter fullscreen mode Exit fullscreen mode

K-nearest Neighbors (KNN)

K-nearest neighbors model is meant to be used when you are trying to find similarities between observations. This is exactly our case because we are trying to find patterns in the past release dates.

KNN algorithm has less parameters to tune, so it is more simple for those who have not used it before.

knn = KNeighborsClassifier(n_neighbors = 3, algorithm = 'auto',
                           weights = 'distance')
knn.fit(x_train, y_train)
knn_prediction = knn.predict(x_test)
Enter fullscreen mode Exit fullscreen mode

Random Forest

Random forest basic model parameters tuning usually doesn’t take a lot of time. You simply iterate over the possible number of estimators and the maximum depth of trees and choose optimal ones using elbow method.

random_forest = RandomForestClassifier(n_estimators=50,
                                       max_depth=10, random_state=1)
random_forest.fit(x_train, y_train)
rf_prediction = random_forest.predict(x_test)
Enter fullscreen mode Exit fullscreen mode

Comparing the results

We will use confusion matrix to evaluate performance of trained models. It helps us compare models side by side and understand whether our parameters should be tuned any further.

xgb_matrix = metrics.confusion_matrix(xgb_prediction, y_test)
print(f"""
Confusion matrix for XGBoost model:
TN:{xgb_matrix[0][0]}    FN:{xgb_matrix[0][1]}
FP:{xgb_matrix[1][0]}    TP:{xgb_matrix[1][1]}""")knn_matrix = metrics.confusion_matrix(knn_prediction, y_test)
print(f"""
Confusion matrix for KNN model:
TN:{knn_matrix[0][0]}    FN:{knn_matrix[0][1]}
FP:{knn_matrix[1][0]}    TP:{knn_matrix[1][1]}""")rf_matrix = metrics.confusion_matrix(rf_prediction, y_test)
print(f"""
Confusion matrix for Random Forest model:
TN:{rf_matrix[0][0]}    FN:{rf_matrix[0][1]}
FP:{rf_matrix[1][0]}    TP:{rf_matrix[1][1]}""")
Enter fullscreen mode Exit fullscreen mode

As you see, both XGBoost and RandomForest show good performance. They both were able to catch the pattern and predict dates correctly in most cases. However, both models made a mistake with December 2020 release, because it breaks release pattern.

KNN is less accurate than the previous two. It failed to predict three dates correctly and missed 5 releases. At this point, we do not proceed with KNN. In general, it works better if data is normalized, so you can try to tune it if you want.

Concerning the remaining two, for the initial goal, XGBoost model is considered to be overcomplicated in terms of hyperparameters tuning, so RandomForest should be our choice.

Now we need to create DataFrame with future dates for prediction and use trained RandomForest model to predict future releases for one year ahead.

x_predict = pd.DataFrame(pd.date_range(date.today(), (date.today() +
            relativedelta(years=1)),freq='d'), columns=['Date'])
x_predict['Month'] = x_predict['Date'].dt.month
x_predict['Day'] = x_predict['Date'].dt.day
x_predict['Workday_N'] = np.busday_count(
                x_predict['Date'].values.astype('datetime64[M]'),
                x_predict['Date'].values.astype('datetime64[D]'))
x_predict['Week_day'] = x_predict['Date'].dt.weekday
x_predict['Week_of_month'] = (x_predict['Date'].dt.day -
                              x_predict['Date'].dt.weekday - 2)//7+2
x_predict['Weekday_order'] = (x_predict['Date'].dt.day + 6) // 7
x_predict = x_predict.set_index('Date')prediction = xgb_model.predict(x_predict)
Enter fullscreen mode Exit fullscreen mode

That’s it — we created forecast of release dates for U.S. CB Consumer Confidence series for one year ahead.

Conclusion

If you want to predict future dates for periodic events, you should think about meaningful features to create. They should include all information about patterns you can find in history. As you can see we did not spend a lot of time on model’s tuning — even simple models can give good results if you use the right features.

Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.

💖 💪 🙅 🚩
annpastushko
Anna Pastushko

Posted on July 7, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related