'Drop every duplicate value including the first instance of that value in R [duplicate]
I am trying to filter a grouped data frame where:
- Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
- The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).
EDIT: I was originally grouping by "sources" AND "db_source" but have found that I shouldn't be.
So, given:
example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 8),
data_val = rep(c(1, 1, 2, 3, 4, 4, 5, 6), 3),
db_source = rep(c("DB1", "DB2"), 12))
sources data_val db_source
1 S1 1 DB1
2 S1 1 DB2
3 S1 2 DB1
4 S1 3 DB2
5 S1 4 DB1
6 S1 4 DB2
7 S1 5 DB1
8 S1 6 DB2
9 S2 1 DB1
10 S2 1 DB2
11 S2 2 DB1
12 S2 3 DB2
13 S2 4 DB1
14 S2 4 DB2
15 S2 5 DB1
16 S2 6 DB2
17 S3 1 DB1
18 S3 1 DB2
19 S3 2 DB1
20 S3 3 DB2
21 S3 4 DB1
22 S3 4 DB2
23 S3 5 DB1
24 S3 6 DB2
I want the following:
sources data_val db_source
3 S1 2 DB1
4 S1 3 DB2
7 S1 5 DB1
8 S1 6 DB2
11 S2 2 DB1
12 S2 3 DB2
15 S2 5 DB1
16 S2 6 DB2
19 S3 2 DB1
20 S3 3 DB2
23 S3 5 DB1
24 S3 6 DB2
I tried using functions like `duplicated()` and `distinct()` in my pipe, but they will return the following:
sources data_val db_source
1 S1 1 DB1
2 S1 2 DB1
3 S1 3 DB2
4 S1 4 DB1
5 S1 5 DB1
6 S1 6 DB2
7 S2 1 DB1
8 S2 2 DB1
9 S2 3 DB2
10 S2 4 DB1
11 S2 5 DB1
12 S2 6 DB2
13 S3 1 DB1
14 S3 2 DB1
15 S3 3 DB2
16 S3 4 DB1
17 S3 5 DB1
18 S3 6 DB2
I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.
Thank you to anyone who can help.
Solution 1:[1]
dplyr
library(dplyr)
example_data %>%
group_by(sources, data_val) %>%
filter(n() < 2) %>%
ungroup()
# # A tibble: 12 x 3
# sources data_val db_source
# <chr> <dbl> <chr>
# 1 S1 2 DB1
# 2 S1 3 DB2
# 3 S1 5 DB1
# 4 S1 6 DB2
# 5 S2 2 DB1
# 6 S2 3 DB2
# 7 S2 5 DB1
# 8 S2 6 DB2
# 9 S3 2 DB1
# 10 S3 3 DB2
# 11 S3 5 DB1
# 12 S3 6 DB2
base R
Either one works:
example_data[ave(example_data[ave(seq_len(nrow(example_data)), example_data[c("sources","data_val")], FUN=length) < 2,]
# sources data_val db_source
# 3 S1 2 DB1
# 4 S1 3 DB2
# 7 S1 5 DB1
# 8 S1 6 DB2
# 11 S2 2 DB1
# 12 S2 3 DB2
# 15 S2 5 DB1
# 16 S2 6 DB2
# 19 S3 2 DB1
# 20 S3 3 DB2
# 23 S3 5 DB1
# 24 S3 6 DB2
Solution 2:[2]
Another optione would be using lag function:
library(dplyr)
example_data %>%
group_by(sources) %>%
filter(data_val != lag(data_val)) %>%
ungroup()
sources data_val db_source
<chr> <dbl> <chr>
1 S1 2 DB1
2 S1 3 DB2
3 S2 2 DB1
4 S2 3 DB2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | TarJae |
