'Drop every duplicate value including the first instance of that value in R [duplicate]

I am trying to filter a grouped data frame where:

  • Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
  • The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).

EDIT: I was originally grouping by "sources" AND "db_source" but have found that I shouldn't be.

So, given:

example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 8),
                           data_val = rep(c(1, 1, 2, 3, 4, 4, 5, 6), 3),
                           db_source = rep(c("DB1", "DB2"), 12))

       sources data_val db_source
1       S1        1       DB1
2       S1        1       DB2
3       S1        2       DB1
4       S1        3       DB2
5       S1        4       DB1
6       S1        4       DB2
7       S1        5       DB1
8       S1        6       DB2
9       S2        1       DB1
10      S2        1       DB2
11      S2        2       DB1
12      S2        3       DB2
13      S2        4       DB1
14      S2        4       DB2
15      S2        5       DB1
16      S2        6       DB2
17      S3        1       DB1
18      S3        1       DB2
19      S3        2       DB1
20      S3        3       DB2
21      S3        4       DB1
22      S3        4       DB2
23      S3        5       DB1
24      S3        6       DB2

I want the following:

       sources data_val db_source
3       S1        2       DB1
4       S1        3       DB2
7       S1        5       DB1
8       S1        6       DB2
11      S2        2       DB1
12      S2        3       DB2
15      S2        5       DB1
16      S2        6       DB2
19      S3        2       DB1
20      S3        3       DB2
23      S3        5       DB1
24      S3        6       DB2

I tried using functions like `duplicated()` and `distinct()` in my pipe, but they will return the following:

       sources data_val db_source   
 1      S1        1       DB1      
 2      S1        2       DB1      
 3      S1        3       DB2      
 4      S1        4       DB1      
 5      S1        5       DB1      
 6      S1        6       DB2      
 7      S2        1       DB1      
 8      S2        2       DB1      
 9      S2        3       DB2      
10      S2        4       DB1      
11      S2        5       DB1      
12      S2        6       DB2      
13      S3        1       DB1      
14      S3        2       DB1      
15      S3        3       DB2      
16      S3        4       DB1      
17      S3        5       DB1      
18      S3        6       DB2

I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.

Thank you to anyone who can help.



Solution 1:[1]

dplyr

library(dplyr)
example_data %>%
  group_by(sources, data_val) %>%
  filter(n() < 2) %>%
  ungroup()
# # A tibble: 12 x 3
#    sources data_val db_source
#    <chr>      <dbl> <chr>    
#  1 S1             2 DB1      
#  2 S1             3 DB2      
#  3 S1             5 DB1      
#  4 S1             6 DB2      
#  5 S2             2 DB1      
#  6 S2             3 DB2      
#  7 S2             5 DB1      
#  8 S2             6 DB2      
#  9 S3             2 DB1      
# 10 S3             3 DB2      
# 11 S3             5 DB1      
# 12 S3             6 DB2      

base R

Either one works:

example_data[ave(example_data[ave(seq_len(nrow(example_data)), example_data[c("sources","data_val")], FUN=length) < 2,]
#    sources data_val db_source
# 3       S1        2       DB1
# 4       S1        3       DB2
# 7       S1        5       DB1
# 8       S1        6       DB2
# 11      S2        2       DB1
# 12      S2        3       DB2
# 15      S2        5       DB1
# 16      S2        6       DB2
# 19      S3        2       DB1
# 20      S3        3       DB2
# 23      S3        5       DB1
# 24      S3        6       DB2

Solution 2:[2]

Another optione would be using lag function:

library(dplyr)

example_data %>%
  group_by(sources) %>% 
  filter(data_val != lag(data_val)) %>% 
  ungroup()


  sources data_val db_source
  <chr>      <dbl> <chr>    
1 S1             2 DB1      
2 S1             3 DB2      
3 S2             2 DB1      
4 S2             3 DB2  

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 TarJae