'How to run multiple linear regression models for columns that start at different times, with a new dataframe as output, in R?
I am trying to run multiple linear regression models with columns in a dataframe that all start at different times.
df = structure(list(Date_Time_GMT_3 =
structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900),
class = c("POSIXct","POSIXt"),
tzone = "EST"),
X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753),
X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_),
X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_),
X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404),
X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_),
X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567),
X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_),
X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821),
X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235),
X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_),
X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)),
row.names = c(NA, 6L), class = "data.frame")
I am trying perform a linear regression model for all columns WITHOUT stationary in the column name to columns WITH stationary in the column name (i.e. stationary on the x-axis). However, as you can see in the sample dataframe, all the columns start and end at different times, so I also need the linear regression model to only run when the stationary and non-stationary columns have values at the same time.
Overall, I would like the output of the code to give me all the values for each "non-stationary" logger when run against each stationary logger. For example, a table like the one below...
X20819830_R1AR_U_Stationary
Logger_ID Reg_equation R_Squared Estimate_Std. Std_Error Pr_t..
<chr> <int> <int> <int> <int> <int>
1 X20676887_F8AR_U NA NA NA NA NA
2 X20819831_X1AR_U NA NA NA NA NA
3 X20822214_R2AR_U NA NA NA NA NA
1 X20676887_R7AR_U NA NA NA NA NA
2 X20819831_R5AR_U NA NA NA NA NA
X20822215_R3AR_U_Stationary
Logger_ID Reg_equation R_Squared Estimate_Std. Std_Error Pr_t..
<chr> <int> <int> <int> <int> <int>
1 X20676887_F8AR_U NA NA NA NA NA
2 X20819831_X1AR_U NA NA NA NA NA
3 X20822214_R2AR_U NA NA NA NA NA
1 X20676887_R7AR_U NA NA NA NA NA
2 X20819831_R5AR_U NA NA NA NA NA
I have code that can do it if I only have 1 stationary column, and if all the columns have values for the entire time.... which looks like so
library(tidyverse)
library(broom)
df1 %>%
pivot_longer(
cols = starts_with("X")
) %>%
mutate(name = factor(name)) %>%
group_by(name) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(X20819830_R3AR_U_Stationary ~ value, data = df) %>%
glance() %>%
# tidy() %>%
add_column(name = unique(df$name), .before=1)
})
but as you can see it only takes into account 1 stationary column
, and when I run it with example dataframe I provided I get this error
Error in eval(predvars, data, env) :
object 'X20819830_R3AR_U_Stationary' not found
Any ideas?
Solution 1:[1]
Now that it is clear what you want to achieve, it turns out it's pretty easy to do by applying pivot_longer twice, once for stationary and then for non-stationary loggers.
Aside: Please look at how to correct for multiple hypothesis testing when interpreting the results. This is a lot of regressions.
library("broom")
library("tidyverse")
extract_statistics <- function(fit) {
bind_cols(
# Extract statistics about model coefficients
tidy(fit) %>% filter(term != "(Intercept)"),
# Extract statistics about model fit
glance(fit) %>% select(matches("r.squared"))
)
}
results <-
as_tibble(df) %>%
pivot_longer(
ends_with("STATIONARY"),
names_to = "response",
values_to = "y"
) %>%
# Every column other than the time index & the newly minted response & y columns
# corresponds to a non-stationary logger.
pivot_longer(
-c(Date_Time_GMT_3, response, y),
names_to = "predictor",
values_to = "x"
) %>%
# It's not strictly necessary; `lm` drops data points with missing values
drop_na() %>%
group_by(
response, predictor
) %>%
group_modify(
# ~ tidy(lm(y ~ x, data = .))
# ~ glance(lm(y ~ x, data = .))
~ extract_statistics(lm(y ~ x, data = .))
) %>%
ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
results
#> # A tibble: 28 × 9
#> response predictor term estimate std.error statistic p.value r.squared
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X20819742_R1A… X2081772… x 0.0893 0.0362 2.47 0.245 0.859
#> 2 X20819742_R1A… X2081772… x 0.599 0.677 0.885 0.539 0.439
#> 3 X20819742_R1A… X2081974… x -0.0932 0.00776 -12.0 0.0528 0.993
#> 4 X20819742_R1A… X2081974… x -0.0117 0.0761 -0.154 0.903 0.0231
#> 5 X20819830_R1A… X2081772… x 0.712 0.804 0.886 0.539 0.440
#> 6 X20819830_R1A… X2081772… x -10.3 0.857 -12.1 0.0527 0.993
#> 7 X20819830_R1A… X2081974… x -1.49 0.222 -6.70 0.0944 0.978
#> 8 X20822215_R3A… X2081772… x 0.0154 0.130 0.118 0.925 0.0138
#> 9 X20822215_R3A… X2081772… x 1.20 0.271 4.43 0.141 0.951
#> 10 X20822215_R3A… X2081974… x -0.0703 0.106 -0.663 0.627 0.305
#> # … with 18 more rows, and 1 more variable: adj.r.squared <dbl>
Let's see the pairs of loggers that cause the warnings about a perfect fit:
results %>%
slice_min(
std.error,
n = 2
)
#> # A tibble: 2 × 9
#> response predictor term estimate std.error statistic p.value r.squared
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X20822215_R3A… X2081772… x 1 1.77e-16 5.65e15 1.13e-16 1
#> 2 X20874235_R4A… X2081772… x 1 1.77e-16 5.65e15 1.13e-16 1
#> # … with 1 more variable: adj.r.squared <dbl>
Yep, perfect fit it is.
df %>%
select(X20822215_R3AR_U_Stationary, X20817727_R5AR_U)
#> X20822215_R3AR_U_Stationary X20817727_R5AR_U
#> 1 NA NA
#> 2 NA NA
#> 3 NA NA
#> 4 13.942 13.942
#> 5 13.942 13.942
#> 6 13.846 13.846
df %>%
select(X20874235_R4AR_S_Stationary, X20817727_R7AR)
#> X20874235_R4AR_S_Stationary X20817727_R7AR
#> 1 14.230 14.230
#> 2 14.230 14.230
#> 3 14.134 14.134
#> 4 NA NA
#> 5 NA NA
#> 6 NA NA
Created on 2022-03-21 by the reprex package (v2.0.1)
I apologize for the code duplication in advance. This is a followup in response to a further question from the OP.
Here is how to switch the role of response and predictor.
df <- structure(list(
Date_Time_GMT_3 =
structure(c(1622552400, 1622553300, 1622554200, 1622555100, 1622556000, 1622556900),
class = c("POSIXct", "POSIXt"),
tzone = "EST"
),
X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753),
X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_),
X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_),
X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404),
X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_),
X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567),
X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_),
X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821),
X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235),
X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_),
X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)
),
row.names = c(NA, 6L), class = "data.frame"
)
library("broom")
library("tidyverse")
extract_statistics <- function(fit) {
bind_cols(
# Extract statistics about model coefficients
tidy(fit) %>% filter(term != "(Intercept)"),
# Extract statistics about model fit
glance(fit) %>% select(matches("r.squared"))
)
}
results <-
as_tibble(df) %>%
# Pivot non-stationary loggers
pivot_longer(
-c(Date_Time_GMT_3, ends_with("STATIONARY")),
names_to = "response",
values_to = "y"
) %>%
# Every column other than the time index & the newly minted response & y columns
# corresponds to a non-stationary logger.
pivot_longer(
-c(Date_Time_GMT_3, response, y),
names_to = "predictor",
values_to = "x"
) %>%
# It's not strictly necessary; `lm` drops data points with missing values
drop_na() %>%
group_by(
response, predictor
) %>%
group_modify(
# ~ tidy(lm(y ~ x, data = .))
# ~ glance(lm(y ~ x, data = .))
~ extract_statistics(lm(y ~ x, data = .))
) %>%
ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
results
#> # A tibble: 28 × 9
#> response predictor term estimate std.error statistic p.value r.squared
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X20817727_F8… X2081974… x 9.62 3.90e+ 0 2.47e+ 0 2.45e- 1 0.859
#> 2 X20817727_F8… X2081983… x 0.617 6.97e- 1 8.86e- 1 5.39e- 1 0.440
#> 3 X20817727_F8… X2082221… x 0.896 7.58e+ 0 1.18e- 1 9.25e- 1 0.0138
#> 4 X20817727_F8… X2082221… x -6.98 6.68e+ 0 -1.05e+ 0 4.86e- 1 0.522
#> 5 X20817727_F8… X2087423… x 3.13 8.84e+ 0 3.53e- 1 7.84e- 1 0.111
#> 6 X20817727_F8… X2087423… x 2.15 1.50e+ 0 1.44e+ 0 3.87e- 1 0.673
#> 7 X20817727_F8… X2087431… x -2.12 2.60e+ 0 -8.18e- 1 5.64e- 1 0.401
#> 8 X20817727_F8… X2087431… x 2.58 1.06e+ 0 2.43e+ 0 2.49e- 1 0.855
#> 9 X20817727_R5… X2081983… x -0.0961 7.96e- 3 -1.21e+ 1 5.27e- 2 0.993
#> 10 X20817727_R5… X2082221… x 1 1.77e-16 5.65e+15 1.13e-16 1
#> # … with 18 more rows, and 1 more variable: adj.r.squared <dbl>
Created on 2022-03-31 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
