XGBoost for Time Series Forecasting: Avoiding Data Leakage and Finding a Solution | Ranjan Kumar

Hey there, fellow machine learning enthusiasts! I’ve been diving into using XGBoost for time series forecasting, and I’ve come across a common issue that I’d love to discuss with you all. If you’ve worked with XGBoost for time series forecasting, you might have encountered this problem too.

The issue is about the test set of XGBoost for time series forecasting. From various articles, I’ve seen that people use a sliding window approach to create feature variables and target variables. For example, they use $(t_1, t_2, …, t_n, t_{n+1}, t_{n+2}…, t_m)$, where the first $n$ variables are used as feature variables and the last $m$ variables are used as target variables. Then, they feed these rows into XGBoost to find the relationship between the feature variables and target variables.

The problem arises when we’re predicting the future $m$ points. During the testing phase, it appears that they use the actual feature variables for testing. For instance, when predicting the first future $m$ points, we still have the actual $n$ points before these $m$ points as the features. However, when we’re predicting the $m+1$ points, we’re missing the actual value for the first feature in the $n$ features.

This raises the question: do these methods suffer from data leakage? Or is it safe to assume that we can know the real $n$ features when we’re focusing on the $m$ new data points?

My current idea is that for using XGBoost in time series forecasting, we can either feed back the predicted value as the $n$ feature for the upcoming forecasting of $m$ points or train $L$ independent regressors to forecast the $L$ points in the future in one batch.

What do you think, fellow machine learners? Have you encountered this issue before, and how did you tackle it? Let’s discuss!

Leave a Comment Cancel Reply