'R: Remove duplicates row based on certain criteria
I want to remove duplicate based on certain criteria. My data look like:
Animal<-c("bird","Bird ","Dog","Cat F","Lion","Lion","Lion","dog","Horse","cat", "Lion")
A_date<-c("02-08-2020","20-06-2018","01-01-2015","10-07-2021","20-06-2018","15-08-2019","05-08-2013","20-06-2010","15-11-2016","22-03-2022","15-05-2019")
ID<-c("T1", "T1","T1","T2","T2","T3","T3","T4","T4","T5","T5")
Mydata<-data.frame(Animal, A_date,col_1)
Animal A_date ID
bird 02-08-2020 T1
Bird 20-06-2018 T1
Dog 01-01-2015 T1
Cat F 10-07-2021 T2
Lion 20-06-2018 T2
Lion 15-08-2019 T3
lion 05-08-2013 T3
dog 20-06-2010 T4
Horse 15-11-2016 T4
cat 22-03-2022 T5
Lion 15-05-2019 T5
I want to remove duplicate row so that only the row with latest date pr. ID will remain. For example in the above table Lion appears 3 times with the same ID. So I want to remain with only Lion 15-08-2019 T3 but I want to keep Lion that has ID T5.
The end result should be like the following:
Animal A_date ID
Dog 01-01-2015 T1
bird 02-08-2020 T1
Dog 01-01-2015 T1
Cat F 10-07-2021 T2
Lion 15-08-2019 T3
dog 20-06-2010 T4
Horse 15-11-2016 T4
cat 22-03-2022 T5
Lion 15-05-2019 T5
The data that I working on is very large and has ID from T1 to T20. I have so fare tried the following code. But it does not work properly
library(lubridate)
library(dplyr)
Mydata <- Mydata %>%
mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
arrange(A_date)
Mydata %>%
filter(!duplicated(Animal, fromLast = TRUE))
the result that I get
Animal A_date ID
DOG <NA> T1
HORSE <NA> T4
BIRD <NA> T1
LION <NA> T3
BIRD <NA> T1
CAT F <NA> T2
CAT <NA> T5
This is not the end result that i want.
Solution 1:[1]
One option is to group by ID and Animal, then arrange so that for each group the most recent date is at the top for that group (i.e., the latest date), then slice that row.
library(lubridate)
library(dplyr)
Mydata %>%
mutate(Animal = trimws(toupper(Animal)), A_date = lubridate::dmy(A_date)) %>%
group_by(ID, Animal) %>%
arrange(ID, Animal, desc(A_date)) %>%
slice(1)
Output
Animal A_date ID
<chr> <date> <chr>
1 BIRD 2020-08-02 T1
2 DOG 2015-01-01 T1
3 CAT F 2021-07-10 T2
4 LION 2018-06-20 T2
5 LION 2019-08-15 T3
6 DOG 2010-06-20 T4
7 HORSE 2016-11-15 T4
8 CAT 2022-03-22 T5
9 LION 2019-05-15 T5
Solution 2:[2]
We can try slice_max over A_date
Mydata %>%
mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
group_by(ID, Animal) %>%
slice_max(A_date) %>%
ungroup()
which gives
# A tibble: 10 x 3
Animal A_date ID
<chr> <date> <chr>
1 "BIRD" 2020-08-02 T1
2 "BIRD " 2018-06-20 T1
3 "DOG" 2015-01-01 T1
4 "CAT F" 2021-07-10 T2
5 "LION" 2018-06-20 T2
6 "LION" 2019-08-15 T3
7 "DOG" 2010-06-20 T4
8 "HORSE" 2016-11-15 T4
9 "CAT" 2022-03-22 T5
10 "LION" 2019-05-15 T5
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | ThomasIsCoding |
