'How to run multiple linear regression models for columns that start at different times, with a new dataframe as output, in R?

I am trying to run multiple linear regression models with columns in a dataframe that all start at different times.

df = structure(list(Date_Time_GMT_3 = 
                      structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900), 
                                class = c("POSIXct","POSIXt"), 
                                tzone = "EST"),
                    X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753), 
                    X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_), 
                    X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846), 
                    X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_), 
                    X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404), 
                    X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_), 
                    X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567), 
                    X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_), 
                    X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821), 
                    X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235), 
                    X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_), 
                    X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846), 
                    X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)), 
               row.names = c(NA, 6L), class = "data.frame")

I am trying perform a linear regression model for all columns WITHOUT stationary in the column name to columns WITH stationary in the column name (i.e. stationary on the x-axis). However, as you can see in the sample dataframe, all the columns start and end at different times, so I also need the linear regression model to only run when the stationary and non-stationary columns have values at the same time.

Overall, I would like the output of the code to give me all the values for each "non-stationary" logger when run against each stationary logger. For example, a table like the one below...

X20819830_R1AR_U_Stationary
Logger_ID        Reg_equation R_Squared Estimate_Std. Std_Error  Pr_t..
  <chr>            <int>               <int>     <int>        <int>     <int>   
1 X20676887_F8AR_U NA                  NA        NA            NA         NA      
2 X20819831_X1AR_U NA                  NA        NA            NA         NA      
3 X20822214_R2AR_U NA                  NA        NA            NA         NA  
1 X20676887_R7AR_U NA                  NA        NA            NA         NA      
2 X20819831_R5AR_U NA                  NA        NA            NA         NA      
    

X20822215_R3AR_U_Stationary
Logger_ID        Reg_equation R_Squared Estimate_Std. Std_Error  Pr_t..
  <chr>            <int>               <int>     <int>        <int>     <int>   
1 X20676887_F8AR_U NA                  NA        NA            NA         NA      
2 X20819831_X1AR_U NA                  NA        NA            NA         NA      
3 X20822214_R2AR_U NA                  NA        NA            NA         NA  
1 X20676887_R7AR_U NA                  NA        NA            NA         NA      
2 X20819831_R5AR_U NA                  NA        NA            NA         NA

I have code that can do it if I only have 1 stationary column, and if all the columns have values for the entire time.... which looks like so

library(tidyverse)
library(broom)
df1 %>% 
  pivot_longer(
    cols = starts_with("X")
  ) %>% 
  mutate(name = factor(name)) %>% 
  group_by(name) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(X20819830_R3AR_U_Stationary ~ value, data = df) %>% 
      glance() %>% 
      # tidy() %>%  
      add_column(name = unique(df$name), .before=1)
  })

but as you can see it only takes into account 1 stationary column , and when I run it with example dataframe I provided I get this error

Error in eval(predvars, data, env) : 
object 'X20819830_R3AR_U_Stationary' not found

Any ideas?

r dataframe linear-regression

Solution 1:^[1]

Now that it is clear what you want to achieve, it turns out it's pretty easy to do by applying pivot_longer twice, once for stationary and then for non-stationary loggers.

Aside: Please look at how to correct for multiple hypothesis testing when interpreting the results. This is a lot of regressions.

library("broom")
library("tidyverse")

extract_statistics <- function(fit) {
  bind_cols(
    # Extract statistics about model coefficients
    tidy(fit) %>% filter(term != "(Intercept)"),
    # Extract statistics about model fit
    glance(fit) %>% select(matches("r.squared"))
  )
}

results <-
  as_tibble(df) %>%
  pivot_longer(
    ends_with("STATIONARY"),
    names_to = "response",
    values_to = "y"
  ) %>%
  # Every column other than the time index & the newly minted response & y columns
  # corresponds to a non-stationary logger.
  pivot_longer(
    -c(Date_Time_GMT_3, response, y),
    names_to = "predictor",
    values_to = "x"
  ) %>%
  # It's not strictly necessary; `lm` drops data points with missing values
  drop_na() %>%
  group_by(
    response, predictor
  ) %>%
  group_modify(
    # ~ tidy(lm(y ~ x, data = .))
    # ~ glance(lm(y ~ x, data = .))
    ~ extract_statistics(lm(y ~ x, data = .))
  ) %>%
  ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
results
#> # A tibble: 28 × 9
#>    response       predictor term  estimate std.error statistic p.value r.squared
#>    <chr>          <chr>     <chr>    <dbl>     <dbl>     <dbl>   <dbl>     <dbl>
#>  1 X20819742_R1A… X2081772… x       0.0893   0.0362      2.47   0.245     0.859 
#>  2 X20819742_R1A… X2081772… x       0.599    0.677       0.885  0.539     0.439 
#>  3 X20819742_R1A… X2081974… x      -0.0932   0.00776   -12.0    0.0528    0.993 
#>  4 X20819742_R1A… X2081974… x      -0.0117   0.0761     -0.154  0.903     0.0231
#>  5 X20819830_R1A… X2081772… x       0.712    0.804       0.886  0.539     0.440 
#>  6 X20819830_R1A… X2081772… x     -10.3      0.857     -12.1    0.0527    0.993 
#>  7 X20819830_R1A… X2081974… x      -1.49     0.222      -6.70   0.0944    0.978 
#>  8 X20822215_R3A… X2081772… x       0.0154   0.130       0.118  0.925     0.0138
#>  9 X20822215_R3A… X2081772… x       1.20     0.271       4.43   0.141     0.951 
#> 10 X20822215_R3A… X2081974… x      -0.0703   0.106      -0.663  0.627     0.305 
#> # … with 18 more rows, and 1 more variable: adj.r.squared <dbl>

Let's see the pairs of loggers that cause the warnings about a perfect fit:

results %>%
  slice_min(
    std.error,
    n = 2
  )
#> # A tibble: 2 × 9
#>   response       predictor term  estimate std.error statistic  p.value r.squared
#>   <chr>          <chr>     <chr>    <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 X20822215_R3A… X2081772… x            1  1.77e-16   5.65e15 1.13e-16         1
#> 2 X20874235_R4A… X2081772… x            1  1.77e-16   5.65e15 1.13e-16         1
#> # … with 1 more variable: adj.r.squared <dbl>

Yep, perfect fit it is.

df %>%
  select(X20822215_R3AR_U_Stationary, X20817727_R5AR_U)
#>   X20822215_R3AR_U_Stationary X20817727_R5AR_U
#> 1                          NA               NA
#> 2                          NA               NA
#> 3                          NA               NA
#> 4                      13.942           13.942
#> 5                      13.942           13.942
#> 6                      13.846           13.846

df %>%
  select(X20874235_R4AR_S_Stationary, X20817727_R7AR)
#>   X20874235_R4AR_S_Stationary X20817727_R7AR
#> 1                      14.230         14.230
#> 2                      14.230         14.230
#> 3                      14.134         14.134
#> 4                          NA             NA
#> 5                          NA             NA
#> 6                          NA             NA

^{Created on 2022-03-21 by the reprex package (v2.0.1)}

I apologize for the code duplication in advance. This is a followup in response to a further question from the OP.

Here is how to switch the role of response and predictor.

df <- structure(list(
  Date_Time_GMT_3 =
    structure(c(1622552400, 1622553300, 1622554200, 1622555100, 1622556000, 1622556900),
      class = c("POSIXct", "POSIXt"),
      tzone = "EST"
    ),
  X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753),
  X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_),
  X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
  X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_),
  X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404),
  X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_),
  X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567),
  X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_),
  X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821),
  X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235),
  X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_),
  X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
  X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)
),
row.names = c(NA, 6L), class = "data.frame"
)

library("broom")
library("tidyverse")

extract_statistics <- function(fit) {
  bind_cols(
    # Extract statistics about model coefficients
    tidy(fit) %>% filter(term != "(Intercept)"),
    # Extract statistics about model fit
    glance(fit) %>% select(matches("r.squared"))
  )
}

results <-
  as_tibble(df) %>%
  # Pivot non-stationary loggers
  pivot_longer(
    -c(Date_Time_GMT_3, ends_with("STATIONARY")),
    names_to = "response",
    values_to = "y"
  ) %>%
  # Every column other than the time index & the newly minted response & y columns
  # corresponds to a non-stationary logger.
  pivot_longer(
    -c(Date_Time_GMT_3, response, y),
    names_to = "predictor",
    values_to = "x"
  ) %>%
  # It's not strictly necessary; `lm` drops data points with missing values
  drop_na() %>%
  group_by(
    response, predictor
  ) %>%
  group_modify(
    # ~ tidy(lm(y ~ x, data = .))
    # ~ glance(lm(y ~ x, data = .))
    ~ extract_statistics(lm(y ~ x, data = .))
  ) %>%
  ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
results
#> # A tibble: 28 × 9
#>    response      predictor term  estimate std.error statistic  p.value r.squared
#>    <chr>         <chr>     <chr>    <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#>  1 X20817727_F8… X2081974… x       9.62    3.90e+ 0  2.47e+ 0 2.45e- 1    0.859 
#>  2 X20817727_F8… X2081983… x       0.617   6.97e- 1  8.86e- 1 5.39e- 1    0.440 
#>  3 X20817727_F8… X2082221… x       0.896   7.58e+ 0  1.18e- 1 9.25e- 1    0.0138
#>  4 X20817727_F8… X2082221… x      -6.98    6.68e+ 0 -1.05e+ 0 4.86e- 1    0.522 
#>  5 X20817727_F8… X2087423… x       3.13    8.84e+ 0  3.53e- 1 7.84e- 1    0.111 
#>  6 X20817727_F8… X2087423… x       2.15    1.50e+ 0  1.44e+ 0 3.87e- 1    0.673 
#>  7 X20817727_F8… X2087431… x      -2.12    2.60e+ 0 -8.18e- 1 5.64e- 1    0.401 
#>  8 X20817727_F8… X2087431… x       2.58    1.06e+ 0  2.43e+ 0 2.49e- 1    0.855 
#>  9 X20817727_R5… X2081983… x      -0.0961  7.96e- 3 -1.21e+ 1 5.27e- 2    0.993 
#> 10 X20817727_R5… X2082221… x       1       1.77e-16  5.65e+15 1.13e-16    1     
#> # … with 18 more rows, and 1 more variable: adj.r.squared <dbl>

^{Created on 2022-03-31 by the reprex package (v2.0.1)}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'How to run multiple linear regression models for columns that start at different times, with a new dataframe as output, in R?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]