'How to create a first period of treatment variable (for the "did" package)
I am currently trying to run a DiD experiment design following the recent Callaway and Santanna estimator and "did" package in R. In order to do so, I need to obatin a variable that specifies the first year that the subject received the treatment and to get value 0 if the subject has never received the treatment.
Let's say I have this dataset:
subject <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C")
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
df1 <- data.frame(subject, year, treat)
I want to obtain this:
subject <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C")
year <- c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004)
treat <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0)
first_treat <- c(2003, 2003, 2003, 2003, 2003, 2001, 2001, 2001, 2001, 2001, 0, 0, 0, 0, 0)
df2 <- data.frame(subject, year, treat, first_treat)
In my original dataset I have mulriple subjects, so I would like to obtain a code to get this done without the need to mention rows or column values.
I have already tried this:
df1 %>%
group_by(subject) %>%
mutate(first_treat = coalesce(year[treat == 1][1], 0)) %>%
ungroup
But the new variable first_treat appears to only capture the first year of treatment of the first subject, in this case it would be c(2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003).
I have also tried this:
df1 %>%
group_by(subject) %>%
arrange(year, .by_group = T) %>%
mutate(first_treat=case_when(
treat==1 & lag(treat==0) ~ year,
TRUE ~ 0
)) %>%
mutate(first_treat=max(first_treat))
But I obtain the following error: "Error in order(year, .by_group = T) : argument lengths differ"
Do you know an alternative way to get the variable first_treat that I want, or perhaps how to solve the problems with the code i provided?
Thanks!
Solution 1:[1]
Here are two ways:
The first is to create a separate data frame that filters to those given years for each subject and then rejoin it back to the original data frame. Note that for subject C the new variable will return NA which I think is more appropriate than 0.
library(dplyr)
df_to_join <- df1 |>
group_by(subject) |>
filter(treat == 1) |>
filter(year == min(year)) |>
select(subject, first_treat = year)
left_join(df1, df_to_join)
subject year treat first_treat
1 A 2000 0 2003
2 A 2001 0 2003
3 A 2002 0 2003
4 A 2003 1 2003
5 A 2004 1 2003
6 B 2000 0 2001
7 B 2001 1 2001
8 B 2002 1 2001
9 B 2003 1 2001
10 B 2004 1 2001
11 C 2000 0 NA
12 C 2001 0 NA
13 C 2002 0 NA
14 C 2003 0 NA
15 C 2004 0 NA
The second is to create a new variable that satisfies the criteria for that particular row where it occurs, and then use tidyr::fill() to fill in the blanks for each subject:
library(tidyr)
df1 |>
group_by(subject, treat) |>
mutate(first_treat = if_else(treat == 1 & year == min(year), year, NA_real_)) |>
group_by(subject) |>
fill(first_treat, .direction = "downup")
Results would be the same either way.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Phil |
