'Comparing Dates Across Multiple Variables

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:

schedule:

Home Away Home_Final Away_Final Date
DAL OAK 30 35 9/1/2015
KC PHI 21 28 9/2/2015

This is the result I'd like to get:

Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
DAL OAK 30 35 9/1/2015 null null null null null
KC PHI 21 28 9/2/2015 null null null null null
DAL PHI 28 7 9/9/2015 8 7 1 1 1
OAK KC 14 21 9/9/2015 8 7 1 1 0

'Home_Rest' = The home teams amount of days between their games

'Away Rest' = The away teams amount of days between their games

'Adv' = True/False that there was an advantage on one side

'Adv_Days' = The amount of advantage in days

'Adv_Won' = The side with the advantage won

Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.

library(tidyverse)
library(lubridate)

team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))

team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))


Solution 1:[1]

You can extend this to all teams using a grouped mutate. See docs for group_by() here.

Something like

schedule %>%
   group_by(vars_to_group_by) %>%
   mutate(new_var = expr_to_calculate_new_var)

In future, it would be helpful if you included code to recreate a minimal dataset for your example.

Solution 2:[2]

The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.

We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.

schedule <- tidyr::tribble(
  ~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
  "DAL", "OAK", 30, 35, "9/1/2015",
  "KC", "PHI", 21, 28, "9/2/2015",
  "DAL", "PHI", 28, 7, "9/9/2015",
  "OAK", "KC", 14, 21, "9/9/2015"
)

schedule <- schedule %>% mutate(id = row_number())

> schedule
# A tibble: 4 x 6
  Home  Away  Home_Final Away_Final Date        id
  <chr> <chr>      <dbl>      <dbl> <chr>    <int>
1 DAL   OAK           30         35 9/1/2015     1
2 KC    PHI           21         28 9/2/2015     2
3 DAL   PHI           28          7 9/9/2015     3
4 OAK   KC            14         21 9/9/2015     4

Now we'll place your dataframe in a more 'relational' format.

schedule_relational <-
  rbind(
    schedule %>%
      transmute(
        id,
        Team = Home,
        Role = "Home",
        Final = Home_Final,
        Date
      ),
    schedule %>%
      transmute(
        id,
        Team = Away,
        Role = "Away",
        Final = Away_Final,
        Date
      )
  )

> schedule_relational
# A tibble: 8 x 5
     id Team  Role  Final Date    
  <int> <chr> <chr> <dbl> <chr>   
1     1 DAL   Home     30 9/1/2015
2     2 KC    Home     21 9/2/2015
3     3 DAL   Home     28 9/9/2015
4     4 OAK   Home     14 9/9/2015
5     1 OAK   Away     35 9/1/2015
6     2 PHI   Away     28 9/2/2015
7     3 PHI   Away      7 9/9/2015
8     4 KC    Away     21 9/9/2015

How about that!

Now it becomes easy to calculate the difference between dates of games for each team:

schedule_relational <-
  schedule_relational %>%
  group_by(Team) %>%
  arrange(Date) %>%
  mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
  ungroup()

> schedule_relational
# A tibble: 8 x 6
     id Team  Role  Final Date     Rest   
  <int> <chr> <chr> <dbl> <chr>    <drtn> 
1     1 DAL   Home     30 9/1/2015 NA days
2     1 OAK   Away     35 9/1/2015 NA days
3     2 KC    Home     21 9/2/2015 NA days
4     2 PHI   Away     28 9/2/2015 NA days
5     3 DAL   Home     28 9/9/2015  8 days
6     4 OAK   Home     14 9/9/2015  8 days
7     3 PHI   Away      7 9/9/2015  7 days
8     4 KC    Away     21 9/9/2015  7 days

Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.

We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.

result <-
  schedule_relational %>%
  pivot_wider(
    names_from = Role, 
    values_from = c(Team, Final, Rest),
    names_glue = "{Role}_{.value}"
  )

> result
# A tibble: 4 x 8
     id Date     Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
  <int> <chr>    <chr>     <chr>          <dbl>      <dbl> <drtn>    <drtn>   
1     1 9/1/2015 DAL       OAK               30         35 NA days   NA days  
2     2 9/2/2015 KC        PHI               21         28 NA days   NA days  
3     3 9/9/2015 DAL       PHI               28          7  8 days    7 days  
4     4 9/9/2015 OAK       KC                14         21  8 days    7 days

We'll adjust column names and ordering, and make the final calculations now.

result_final <-
  result %>%
  transmute(
    Home = Home_Team,
    Away = Away_Team,
    Home_Final,
    Away_Final,
    Date,
    Home_Rest,
    Away_Rest,
    Adv = as.integer(Home_Rest != Away_Rest),
    Adv_Days = abs(Home_Rest != Away_Rest),
    Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
  )

> result_final
# A tibble: 4 x 10
  Home  Away  Home_Final Away_Final Date     Home_Rest Away_Rest   Adv Adv_Days Adv_Won
  <chr> <chr>      <dbl>      <dbl> <chr>    <drtn>    <drtn>    <int>    <int>   <int>
1 DAL   OAK           30         35 9/1/2015 NA days   NA days      NA       NA      NA
2 KC    PHI           21         28 9/2/2015 NA days   NA days      NA       NA      NA
3 DAL   PHI           28          7 9/9/2015  8 days    7 days       1        1       1
4 OAK   KC            14         21 9/9/2015  8 days    7 days       1        1       0

It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.

I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.

There may be more convoluted solutions, but this is very clear to read and understand.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2