'What is the best practice to split time-series data for regression task within dataframe with or without CV in python?

Let's say I have time-series data within pandas dataframe, including timestamp and a few variables/columns/features. I generated a sample time-series dataset within pandas dataframe as below:

import pandas as pd
import numpy as np
import datetime as dt

#Generate data for 20 days
x1 = np.arange(1, 21) + 0.3 * (np.random.random(size=(20,)) - 0.5)
x2 = np.arange(1, 21) + 0.2 * (np.random.random(size=(20,)) - 0.5)
start = dt.datetime.strptime("1 Nov 01", "%d %b %y")
daterange = pd.date_range(start, periods=20)
table = {"X1": x1, "X2": x2, "date": daterange}
df = pd.DataFrame(table)
df.set_index("date", inplace=True)
df

I checked other posts like this or this, but considering sklearn has TimeSeriesSplit(), based on this answer if data is already sorted based on timestamp then possible to use train_test_split(shuffle=False) within pandas dataframe:

train, test = train_test_split(df, test_size=0.3, shuffle=False) #without CV

I checked if there is a way to use TimeSeriesSplit within pandas dataframe and crossed this post, which for split data with CV was a bit vague due to (n_years - 1) in their scenario:

tscv = TimeSeriesSplit(n_splits=len(df['year'].unique()) - 1)

So far, the best approach I found is to do so according to this answer :

#without CV
time_sample = '2001-11-12'

time_stamp_index = df.index.get_loc(pd.Timestamp(time_sample),method='pad')


X_train = df.iloc[:time_stamp_index,:].values
y_train = df.iloc[:time_stamp_index,:].values

X_test = df.iloc[time_stamp_index:,:].values
y_test = df.iloc[time_stamp_index:,:].values


#with CV
time_sample1 = '2001-11-12'
time_sample2 = '2001-11-16'

time_stamp_index1 = df.index.get_loc(pd.Timestamp(time_sample1),method='pad')
time_stamp_index2 = df.index.get_loc(pd.Timestamp(time_sample2),method='pad')

X_train = df.iloc[:time_stamp_index1,:].values
cv_train = df.iloc[time_stamp_index1:time_stamp_index2,:].values
y_train = df.iloc[:time_stamp_index2,:].values

X_test = df.iloc[time_stamp_index1:,:].values
cv_test = df.iloc[time_stamp_index1:time_stamp_index2,:].values
y_test = df.iloc[time_stamp_index2:,:].values

Nevertheless I found convenient np.split(df, np.where(...) ) based on timestamp based on this answer within dataframe.

Is there any way to elegantly use sklearn TimeSeriesSplit() within the dataframe with/without Cross-validation (CV)?

What is the best practice to split time-series data for regression task:

  • with CV
  • without CV?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source