'R Expanding Window RandomForest, Accuracy Not Dropping Off With Increases in Lag

I am trying make a binary prediction (predicting QQQ states) using 16 input variables. My data set is 2001-2022. Here is what my data set looks like (predicting X0, which is 5 days ahead)

First I use cross validation with an 80-20 train test split on data from 2001-2017 in order to test the accuracy of a potential model.

However, since I want our model doing forward predictions, I train the model using the 2001-2017 data set and make a chronological prediction for the 2018-2022 data set. Understandably, the accuracy drops off

In order to improve the accuracy, I run an expanding window prediction model, where I keep retraining the model using all prior available observations in order to predict the next state in the data set. For each model I increment the training set by one date. The output is a 2018-2022 prediction of states where the state for each date was predicted using a different training set. This ideally should also help the model to train on new market conditions/phases. The accuracy improves.

However, when I change the lags, I begin to notice that the accuracy does not begin to drop off with increased lags…

The code has been checked extensively and it seems like the lags for each dataset are legitimate. This leads to the question…what is wrong with mu model? Might there be a model better suited for our purposes? It also makes me wonder, why is there such a variability in the Sharpe for each model, is the 15th lag having the highest Sharpe purely coincidental? One theory was that the training set is quite comprehensive, therefore the model is great at making prediction regardless of lag in the near term. However, when I took the lags to an extreme, the accuracy still did not drop off:

Should I try using a different model? Any advice or guidance would be greatly appreciated. Please see my code below (the loop commented out is the expanding window RandomForest application).

library(ggplot2)
library(BatchGetSymbols)
library(data.table)
library(plyr)
library(quantmod)
library(PerformanceAnalytics)
defaultW <- getOption("warn")
options(warn = -1)
library(dplyr)
library(caret)
library(ranger)
### Data Import ######
states_full <- read.csv(file = "rolling_qqq_hidden_states_full_five_back.csv")
states_full$formatted_date <- as.Date(states_full$formatted_date)
states_full <- states_full[!duplicated(states_full$formatted_date),]

tickers <- c("QQQ", "^VXN")
l.out <- BatchGetSymbols(tickers = tickers,
                         first.date = states_full$formatted_date[1],
                         last.date = states_full$formatted_date[nrow(states_full)]+1, do.cache=FALSE, be.quiet = TRUE)

price_data <- data.frame(l.out$df.tickers$price.adjusted,l.out$df.tickers$ret.adjusted.prices, l.out$df.tickers$ref.date, l.out$df.tickers$ticker)
colnames(price_data) <- c("Value", "Daily Return", "Date", "Label")

QQQ_full <- price_data[which(price_data$Label == "QQQ"),]



# Make sure dates match
mylist <- c()
for (i in i:nrow(QQQ_full)){
  if (sum(QQQ_full$Date[i] == states_full$formatted_date) != 1){
    mylist <- c(mylist, i)
  }
}

if(length(mylist) > 0){
  QQQ_full <- QQQ_full[-mylist,]
}

mylist <- c()
for (i in 1:nrow(QQQ_01_17)){
  if (sum(states_full$formatted_date[i] == QQQ_full$Date) != 1){
    mylist <- c(mylist, i)
  }
}

if(length(mylist) > 0){
  states_full <- states_full[-mylist,]
}

# split the data into 2001-2017, 2018-2022
states_01_17 <- states_full[1:which(states_full$formatted_date == "2017-12-29"),]
states_17_22 <- states_full[(nrow(states_01_17)+1):nrow(states_full),]
QQQ_01_17<- QQQ_full[1:which(QQQ_full$Date == "2017-12-29"),]
QQQ_17_22 <- QQQ_full[(which(QQQ_full$Date == "2017-12-29")+1):nrow(QQQ_full),]

# build QQQ portfolio
QQQ_portfolio <- as.data.frame(matrix(nrow = nrow(QQQ_17_22) , ncol = 3))
colnames(QQQ_portfolio) <- c("Value", "Date", "Label")
QQQ_portfolio$Value <- 100
QQQ_portfolio$Label <- "QQQ Portfolio"
QQQ_portfolio$Date <- QQQ_17_22$Date

for(m in 2:nrow(QQQ_portfolio)){
  QQQ_portfolio$Value[m] <- QQQ_portfolio$Value[m-1] * (1+QQQ_17_22$`Daily Return`[m])
}

# build non-lagged states portfolio
states_portfolio <- as.data.frame(matrix(nrow = nrow(QQQ_17_22) , ncol = 3))
colnames(states_portfolio) <- c("Value", "Date", "Label")
states_portfolio$Value <- 100
states_portfolio$Label <- "0 Lag RandomForest Prediction of MSDR"
states_portfolio$Date <- QQQ_17_22$Date

for(i in 2:nrow(states_portfolio)){
  if (states_17_22$X0[i-1] == 1){
    states_portfolio$Value[i] <- states_portfolio$Value[i-1] * (1+QQQ_17_22$`Daily Return`[i])
  } else {
    states_portfolio$Value[i] <- states_portfolio$Value[i-1]
  }
}

# Calculate non-lagged sharpe as benchmark
#states_portfolio_returns <- data.frame(Delt(states_portfolio$Value)[-1])
#states_portfolio_returns_xts <- xts(states_portfolio_returns,states_portfolio$Date[-1])
#as.numeric(SharpeRatio.annualized(states_portfolio_returns_xts))

# bind portfolios together for plotting
port_comp <- rbind(QQQ_portfolio,states_portfolio)

# data set that will hold performance metrics
loop_output <- as.data.frame(matrix(0, nrow = 22, ncol = 8))
colnames(loop_output) <- c("Lag", "Cross Validation Accuracy 01-17","Forward Accuracy 18-22","Sharpe", "Average 1YR Rolling Sharpe",
                           "Median 1YR Rolling Sharpe","Min 1YR Rolling Sharpe","Max 1YR Rolling Sharpe")

# read macro data (do it each time because)
macro_full <- data.frame(read.csv("macroindicators3.csv"))

for (j in 2:ncol(macro_full)){
  macro_full[j] <- as.numeric(nafill(macro_full[,j], type = "locf"))
}
macro_full$Date <- as.Date(macro_full[,1], "%m/%d/%Y")
macro_full <- macro_full[,-1]
macro_full <- macro_full[-1,]

# Remove NA columns, can try with more columns values later...
macro_no_na_full <- macro_full[,colSums(is.na(macro_full))==0]

# make sure dates match
mylist <- c()
for (k in 1:nrow(states_full)){
  if (sum(states_full$formatted_date[k] == macro_full$Date) != 1){
    mylist <- c(mylist, k)
  }
}

if(length(mylist) > 0){
  states_full <- states_full[-mylist,]
}


mylist <- c()
for (l in 1:nrow(macro_full)){
  if (sum(macro_full$Date[l] == states_full$formatted_date) != 1){
    mylist <- c(mylist, l)
  }
}
if(length(mylist) > 0){
  macro_full <- macro_full[-mylist,]
}

# states are a factor
states_full$X0 <- as.factor(states_full$X0)



set.seed(42)
for (i in 1:50){
  if (i <= 8){
    lag = i*5 # increment lag by 5 until 40
  } else if (i <= 14){
    lag = 40 + (i-8)*10 # increment lag by 10 until 100
  } else {
    lag = 100+(i-14)*100 # increment lag by 100 until 900
  }
  
  print(lag)
  
  #Save lag
  loop_output$Lag[i] <- lag
  
  #Create a lagged data frame
  full <- cbind(macro_no_na_full[1:(nrow(macro_no_na_full)-lag),], states_full[(lag+1):nrow(states_full),])
  full_01_17 <- full[1:(which(full$Date == "2017-12-29")-lag),]
  full_17_22 <- full[-(1:nrow(full_01_17)),]
  
  # save version with dates to verify lags later
  full_w_dates <- full
  full_01_17_w_dates <- full_01_17
  full_17_22_w_dates <- full_17_22
  
  # remove dates for ml
  full <- full[,-c(17,18)]
  full_01_17 <- full_01_17[,-c(17,18)]
  full_17_22 <- full_17_22[,-c(17,18)]
  
  # this is just for cross validation model
  x_01_17 <- data.frame(full_01_17[,-ncol(full_01_17)])
  y_01_17 <- full_01_17$X0
  
  
  # run cross validation model
  train=sample(nrow(full_01_17),nrow(full_01_17)*.8,replace=FALSE) #Train/Test
  
  rf.reg = ranger(y = y_01_17[train], x= x_01_17[train,] ,mtry=round(sqrt(ncol(x_01_17))),num.trees=200,importance = "impurity") 
  
  y.rf.pred = predict(rf.reg, x_01_17[-train,])$predictions # Predict with bagging
  
  # cross validation model accuracy
  rf.acc = mean(y.rf.pred==y_01_17[-train]) # Directly compute the accuracy
  #rf.acc
  #table(y.rf.pred,y_01_17[-train])
  loop_output$`Cross Validation Accuracy 01-17`[i] <- rf.acc
  
  
  # Expanding window models - takes a while
  # prediction <- as.data.frame(matrix(0,nrow = nrow(full_17_22), ncol= 2)) # data set to store predictions
  # prediction$V1 <- as.factor(c(0,1))[1] # store predictions as a factor
  # previous = 0 # progress bar
  # for(a in nrow(full_01_17):(nrow(full)-1)){ #expanding window starts with 2001-2017, next iteration is 2001-2017+1day
  #  progress = (a-nrow(full_01_17))/(nrow(full)-1-nrow(full_01_17)) # progress bar
  #  progress = round_any(progress, 0.01)                            # progress bar
  #  if (progress != previous){                                      # progress bar
  #    print(progress)                                               # progress bar
  #  }
  #  previous = progress                                             # progress bar
  
  #  rf.reg = ranger(full$X0[1:a]~.,data=full[1:a,],mtry=round(sqrt(ncol(x_01_17))),num.tree=800,importance = 'impurity') # ranger model
  #  y.rf.pred = predict(rf.reg, full[a+1,])$prediction # make the prediction on the a+1 observation
  #  prediction$V1[a-nrow(full_01_17)+1] <- y.rf.pred #save the prediction 
  #  prediction$V2<-as.Date(prediction$V2) # save the date so we can verify lags
  #  prediction$V2[a-nrow(full_01_17)+1] <- as.Date(full_w_dates$formatted_date[a+1])
  #  if (a == nrow(full)-1) message("Done!") # gives a status update
  # }
  # 
  # write.csv(prediction, paste(lag,"lagprediction.csv", sep = "")) # save the prediction so we don't have to rerun
  ####

  ### to read-in results from already completed backtets
  prediction <- read.csv(paste(lag,"lagprediction.csv", sep = ""))[2]
  ###
  full_17_22_w_pred <- full_17_22_w_dates
  full_17_22_w_pred$prediction <- prediction$V1

  # Evaluate the accuracy
  rf.acc = mean(full_17_22_w_pred$prediction==full_17_22_w_pred$X0)
  loop_output$`Forward Accuracy 18-22`[i] <- rf.acc
  
  # build a portfolio out of the predicted states
  portfolio <- as.data.frame(matrix(0,nrow = nrow(full_17_22), ncol= 3))
  colnames(portfolio) <- c("Value", "Date", "Label")
  portfolio$Date <- full_17_22_w_pred$formatted_date
  portfolio$Value <- 100
  portfolio$Label <- paste(lag,"Lag RandomForest Prediction of MSDR", sep = " ")


  for(b in 2:nrow(portfolio)){
    if (full_17_22_w_pred$prediction[b-1] == 1){
      portfolio$Value[b] <- portfolio$Value[b-1] * (1+QQQ_17_22$`Daily Return`[b])
    } else {
      portfolio$Value[b] <- portfolio$Value[b-1]
    }
  }

  # save it to dataset containing port
  port_comp <- rbind(port_comp, portfolio)

  # calculate Sharpe
  portfolio_returns <- data.frame(Delt(portfolio$Value)[-1])
  portfolio_returns_xts <- xts(portfolio_returns, portfolio$Date[-1])
  loop_output$Sharpe[i] <- as.numeric(SharpeRatio.annualized(portfolio_returns_xts))
  
  # rolling sharpe
  mylist <- c()
  for (z in 1:(nrow(portfolio_returns)-252)){
    portfolio_xts_rolling <- portfolio_returns_xts[z:(z+252)]
    mylist <- c(mylist, as.numeric(SharpeRatio.annualized(portfolio_xts_rolling)))
  }
  loop_output$`Average 1YR Rolling Sharpe`[i]<- mean
  loop_output$`Median 1YR Rolling Sharpe`[i]<- median(mylist)
  loop_output$`Min 1YR Rolling Sharpe`[i]<- min(mylist)
  loop_output$`Max 1YR Rolling Sharpe`[i]<- max(mylist)
}

options(warn = defaultW)

# plot output
ggplot(port_comp, aes(x = port_comp$Date, y = port_comp$Value, color = port_comp$Label, group = port_comp$Label))+geom_line()
 #loop_output_v1 <- rbind(loop_output_v1, loop_output)
loop_output_v1


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source