'Group by ID and keep latest date
I have a dataset that looks like this
ID date
1 2020-05-01
1 2020-06-13
2 2021-01-02
2 2021-01-02
3 2022-01-07
3 2021-12-14
And so on. There are about 30 variables in my dataset. I want to group by ID, and rewrite df$date such that only the latest date remains for each ID:
ID date
1 2020-06-13
1 2020-06-13
2 2021-01-02
2 2021-01-02
3 2022-01-07
3 2022-01-07
Date is formatted as "%Y-%m-%d"
Solution 1:[1]
You can use an if_else statement to check for latest date, and replace any date that is not the latest.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = if_else(date == max(date), as.Date(date), as.Date(max(date))))
# A tibble: 6 × 2
# Groups: ID [3]
ID date
<int> <date>
1 1 2020-06-13
2 1 2020-06-13
3 2 2021-01-02
4 2 2021-01-02
5 3 2022-01-07
6 3 2022-01-07
Solution 2:[2]
We could use max(date) with mutate:
library(dplyr)
library(lubridate)
df %>%
mutate(date = ymd(date)) %>% # need only to transform to date class
group_by(ID) %>%
mutate(date = max(date))
ID date
<int> <date>
1 1 2020-06-13
2 1 2020-06-13
3 2 2021-01-02
4 2 2021-01-02
5 3 2022-01-07
6 3 2022-01-07
Solution 3:[3]
If just keeping the rows with the most recent date per group is your purpose, you could do without group_by.
df %>%
arrange(ID, desc(date)) %>%
distinct(ID, .keep_all = T)
ID date
1 1 2020-06-13
2 2 2021-01-02
3 3 2022-01-07
Solution 4:[4]
This is where the data.table syntax shines.
library(data.table)
setDT(mydata)[, date := max(date), by = ID]
# ID date
# 1: 1 2020-06-13
# 2: 1 2020-06-13
# 3: 2 2021-01-02
# 4: 2 2021-01-02
# 5: 3 2022-01-07
# 6: 3 2022-01-07
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | benson23 |
| Solution 2 | |
| Solution 3 | Lennyy |
| Solution 4 | Wimpel |
