'R Dataframe By Group Calculation

I have a dataframe like below (the real data has many more people and club):

Year   Player    Club
2005   Phelan    Chicago Fire 
2007   Phelan    Boston Pant
2008   Phelan    Boston Pant
2010   Phelan    Chicago Fire  

2002   John      New York Jet
2006   John      New York Jet
2007   John      Atlanta Elephant
2009   John      Chicago Fire

I want to calculate a club level measure (previous_exp) for each club. Clubs are linked together through the mobility of players. "previous_exp" captures club-level inflow sources for each club.

For example, for Chicago Fire, Phelan came to this club in 2005 and then left. He returned in 2010. Before Phelan's first stay in Chicago Fire, he had zero previous experience (so we ignore it). However, before his second stay in Chicago Fire, he had 5 years of previous experience (2005-2009) in two different clubs (Chicago Fire and Boston Pant). For John, before he came to Chicago Fire, he had 7 years of previous experience (New York Jet and Atlanta Elephant).

For Chicago Fire, based on the career records of Phelan and John, it in total accumulated 0+5+7=12 years of previous experience from other clubs. Among these 12 years, 2 years are from Chicago Fire itself (Phelan 2005-2006 assuming 2006 Phelan stayed in Chicago Fire), 3 years are from Boston Pant (Phelan 2007-2009 assuming 2009 Phelan stayed in Boston Pant), 5 years from New York Jet (John 2002-2006) and 2 years are from Atlanta Elephant (John 2007-2008).

Finally, I can calculate "previous_exp" for Chicago Fire using the formula:(length of the experience 1/total years)^2+(length of the experience 2/total years)^2+......, which equals to (2/12)^2+(3/12)^2+(5/12)^2+(2/12)^2=0.292. I want to calculate this measure for all clubs. Below is a sample output (Noting that New York Jet is NA because no previous experience is linked to New York Jet):

Club               previous_exp
Chicago Fire       0.292
Boston Pant        1
New York Jet       NA
Atlanta Elephant   1


Solution 1:[1]

Because the length of the individual previous stays is important, and because these accumulate over time, we can make a helper function to keep track of these.

library(tidyverse)

df <- tribble(~Year,   ~Player,    ~Club,
              2005,   "Phelan",    "Chicago Fire",
              2007,   "Phelan",    "Boston Pant",
              2008,   "Phelan",    "Boston Pant",
              2010,   "Phelan",    "Chicago Fire",
              2002,   "John",      "New York Jet",
              2006,   "John",      "New York Jet",
              2007,   "John",      "Atlanta Elephant",
              2009,   "John",      "Chicago Fire") %>% 
    mutate(Year = as.integer(Year))

helper <- function(vec){
    n <- length(vec)
    output <- vector("list", n)
    for(i in 1:n){
        output[[i]] <- vec[1:i]
    }
    output
}


df %>% 
    group_by(Player) %>% 
    arrange(Year, .by_group = TRUE) %>% 
    filter(Club != lag(Club, default = "")) %>% # Must remove Phelan's second Boston Pant stay
    mutate(experience_this_row = Year - lag(Year)) %>% 
    na.omit() %>% 
    mutate(experience_list = helper(experience_this_row)) %>% 
    group_by(Club) %>% 
    summarize(previous_exp = unlist(list(experience_list))) %>% 
    summarize(previous_exp = sum( (previous_exp/sum(previous_exp))^2))

There might be a smoother way for the summarizing by Club. The above works for me, but I'm a bit mystified by the consecutive summarize calls...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Michael Dewar