'Drop every duplicate value including the first instance of that value in R [duplicate]

I am trying to filter a grouped data frame where:

Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).

EDIT: I was originally grouping by "sources" AND "db_source" but have found that I shouldn't be.

So, given:

example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 8),
                           data_val = rep(c(1, 1, 2, 3, 4, 4, 5, 6), 3),
                           db_source = rep(c("DB1", "DB2"), 12))

       sources data_val db_source
1       S1        1       DB1
2       S1        1       DB2
3       S1        2       DB1
4       S1        3       DB2
5       S1        4       DB1
6       S1        4       DB2
7       S1        5       DB1
8       S1        6       DB2
9       S2        1       DB1
10      S2        1       DB2
11      S2        2       DB1
12      S2        3       DB2
13      S2        4       DB1
14      S2        4       DB2
15      S2        5       DB1
16      S2        6       DB2
17      S3        1       DB1
18      S3        1       DB2
19      S3        2       DB1
20      S3        3       DB2
21      S3        4       DB1
22      S3        4       DB2
23      S3        5       DB1
24      S3        6       DB2

I want the following:

       sources data_val db_source
3       S1        2       DB1
4       S1        3       DB2
7       S1        5       DB1
8       S1        6       DB2
11      S2        2       DB1
12      S2        3       DB2
15      S2        5       DB1
16      S2        6       DB2
19      S3        2       DB1
20      S3        3       DB2
23      S3        5       DB1
24      S3        6       DB2

I tried using functions like `duplicated()` and `distinct()` in my pipe, but they will return the following:

       sources data_val db_source   
 1      S1        1       DB1      
 2      S1        2       DB1      
 3      S1        3       DB2      
 4      S1        4       DB1      
 5      S1        5       DB1      
 6      S1        6       DB2      
 7      S2        1       DB1      
 8      S2        2       DB1      
 9      S2        3       DB2      
10      S2        4       DB1      
11      S2        5       DB1      
12      S2        6       DB2      
13      S3        1       DB1      
14      S3        2       DB1      
15      S3        3       DB2      
16      S3        4       DB1      
17      S3        5       DB1      
18      S3        6       DB2

I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.

Thank you to anyone who can help.

r dplyr

Solution 1:^[1]

dplyr

library(dplyr)
example_data %>%
  group_by(sources, data_val) %>%
  filter(n() < 2) %>%
  ungroup()
# # A tibble: 12 x 3
#    sources data_val db_source
#    <chr>      <dbl> <chr>    
#  1 S1             2 DB1      
#  2 S1             3 DB2      
#  3 S1             5 DB1      
#  4 S1             6 DB2      
#  5 S2             2 DB1      
#  6 S2             3 DB2      
#  7 S2             5 DB1      
#  8 S2             6 DB2      
#  9 S3             2 DB1      
# 10 S3             3 DB2      
# 11 S3             5 DB1      
# 12 S3             6 DB2

base R

Either one works:

example_data[ave(example_data[ave(seq_len(nrow(example_data)), example_data[c("sources","data_val")], FUN=length) < 2,]
#    sources data_val db_source
# 3       S1        2       DB1
# 4       S1        3       DB2
# 7       S1        5       DB1
# 8       S1        6       DB2
# 11      S2        2       DB1
# 12      S2        3       DB2
# 15      S2        5       DB1
# 16      S2        6       DB2
# 19      S3        2       DB1
# 20      S3        3       DB2
# 23      S3        5       DB1
# 24      S3        6       DB2

Solution 2:^[2]

Another optione would be using lag function:

library(dplyr)

example_data %>%
  group_by(sources) %>% 
  filter(data_val != lag(data_val)) %>% 
  ungroup()


  sources data_val db_source
  <chr>      <dbl> <chr>    
1 S1             2 DB1      
2 S1             3 DB2      
3 S2             2 DB1      
4 S2             3 DB2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	TarJae

'Drop every duplicate value including the first instance of that value in R [duplicate]

Solution 1:[1]

dplyr

base R

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]