Cross-Validation for Time Series
Abzal Seitkaziyev
Posted on February 15, 2021
Photo by Markus Winkler on Unsplash
To estimate the performance of the machine learning model, we may consider using cross-validation (CV), which uses multiple (e.g. n) train-test splits and trains/tests n models respectively.
There is a k-fold CV in scikit-learn, which splits data into k train-test groups, and it assumes that observations are independent. However, in time series there is a dependency between observations and it could lead to target leak in the estimation when k-fold CV is used.
For Time Series data I explored the following cross-validation techniques:
1) Scikit-learn's Time Series Split.
Here we use expanding window for the train set and a fixed-size window for the test data.
Example of Indices Split:
TRAIN: [0 1 2 3 4 5 ] TEST: [6 7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
...
2) Blocking Time Series Split.
It's when we train and test on different blocks of data.
Example of the split:
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [10 11]
TRAIN: [12 13 14 15 16 17 18 19 20 21] TEST: [22 23]
TRAIN: [24 25 26 27 28 29 30 31 32 33] TEST: [34 35]
...
3)Walk Forward Validation.
a) we use the fixed-size(sliding) window for the train data and one observation ahead for the test.
Example of Indices Split:
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [1 2 3 4 5 6 ] TEST: [7]
TRAIN: [2 3 4 5 6 7 ] TEST: [8]
....
b) we use expanding window for the train data and one observation ahead for the test. Which is a variation of the scikit-learn's Time Series Split.
Example of Indices Split:
TRAIN: [0 1 2 3 4 5] TEST: [6]
TRAIN: [0 1 2 3 4 5 6 ] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7 ] TEST: [8]
....
Posted on February 15, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 4, 2023