'Create a plot of Actual vs Predicted response values, as a function of time, in R

I am trying to plot the actual vs predicted values of some continuous response value (on the y axis), predicted and observed from a random forest model, against the input value of time.

Ideally, the figure generated would look something like below (note- this figure was not actually generated by the dummy data provided, and I am anticipating it to have lower resolution label text and geom points when rendered with R code):

enter image description here

Below is a dummy data set.

# Necessary packages

library('tidyverse')
library('rsample')
library('ggplot2')
library('randomForest')

# Generate the dummy data

dummy_dat <- data.frame(var1 = c(29.897, 29.897, 29.897, 29.897, 29.897, 29.897, 29.897, 29.897, 29.897, 29.897, 29.897, 29.897), 
var2 = c(3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805, 3.7805), 
var3 = c(5, 12, 24, 27, 33, 37, 57, 62, 91, 96, 97, 101), 
var4 = c(0.336, 0.345, 0.433, 0.744, 1.22, 23.8, 6.75, 7.51, 52.73, 38.82, 101.2, 102.45), 
var5 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
time = c(0.0, 0.129, 0.2097, 0.8226, 1.484, 2.290, 3.0, 3.694, 4.581, 4.887, 5.790, 6.5), 
response = c(0, 4.755, 6.152, 7.247, 8.621, 11.388, 14.16, 16.652, 23.897, 28.085, 34.208, 36.7))

# Split the dummy data into a train and test set for random forest

set.seed(123)
dummy_dat_split = dummy_dat %>%
  rsample::initial_split(prop = 0.8)
dummy_dat _train = training(dummy_dat_split) # pull train set
dummy_dat_test = testing(dummy_dat_split) # pull test set

I trained and tested a random forest model on these dummy data with the following code to get the actual vs predicted response values; all of the remaining variables, including time, were the predictor values.

# Train a random forest model
train_rf <- randomForest(response ~ ., data = dummy_dat_train, mtry = 1, maxnodes = 50, ntree = 50) # creates an object of randomForest.formula class

# Test the random forest model
pred_response <- predict(train_rf, newdata = dummy_dat_test[,-7]) # gives a numeric for predicted response values

I am unsure how to generate a plot, similar to the one attached within this post. The predicted response values / numerics can be obtained from the pred_response object generated above.

I can plot a geom_point plot (with ggplot2) for the predicted and observed values, but not as a function of time.

What are the necessary steps and/or functions that I will need to use to render something equivalent to the plot above, where predicted and observed values are a response to time? Thank you in advance.



Solution 1:[1]

You are predicting from dummy_dat_test, so you will have one prediction per row for that data frame. In your example, this data frame only has 3 rows, so it's not a very interesting plot.

The easiest way to plot the predictions is to add them as a column to your data frame:

dummy_dat_test$predict <- pred_response

ggplot(dummy_dat_test, aes(time, response)) +
  geom_point(aes(color = "Observed"), size = 4) +
  geom_line(aes(color = "Observed")) +
  geom_point(aes(y = predict, color = "Predicted"), size = 4) +
  geom_line(aes(y = predict, color = "Predicted")) +
  scale_color_manual(values = c("orange", "deepskyblue3")) +
  theme_bw()

enter image description here

You could show the predictions from your entire data set in much the same way. Here the plot is a bit more interesting:

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1