Cross-Validation for Time Series

Photo by Markus Winkler on Unsplash

To estimate the performance of the machine learning model, we may consider using cross-validation (CV), which uses multiple (e.g. n) train-test splits and trains/tests n models respectively.

There is a k-fold CV in scikit-learn, which splits data into k train-test groups, and it assumes that observations are independent. However, in time series there is a dependency between observations and it could lead to target leak in the estimation when k-fold CV is used.
For Time Series data I explored the following cross-validation techniques:

1) Scikit-learn's Time Series Split.
Here we use expanding window for the train set and a fixed-size window for the test data.

Example of Indices Split:
TRAIN: [0 1 2 3 4 5 ] TEST: [6 7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
...

2) Blocking Time Series Split.
It's when we train and test on different blocks of data.
Example of the split:

TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
TRAIN: [12 13 14 15 16 17 18 19 20 21] TEST: [22 23]
TRAIN: [24 25 26 27 28 29 30 31 32 33] TEST: [34 35]
...

3)Walk Forward Validation.

a) we use the fixed-size(sliding) window for the train data and one observation ahead for the test.

Example of Indices Split:
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [1 2 3 4 5 6 ] TEST: [7]
TRAIN: [2 3 4 5 6 7 ] TEST: [8]
....

b) we use expanding window for the train data and one observation ahead for the test. Which is a variation of the scikit-learn's Time Series Split.

Example of Indices Split:
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [0 1 2 3 4 5 6 ] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7 ] TEST: [8]
....

Blog

Cross-Validation for Time Series

Abzal Seitkaziyev

Join Our Newsletter. No Spam, Only the good stuff.

Related