'Multirows pair-wised comparison of identifiers

After having bound two larges dataframes for the purpose of record linkage, I want to check the identifiers matching process. My first data output looks like this:

id1 <- c("a1", "a1", "b1", "b1", "c1", "c1", "c1", "d1", "d1", "e1", "e1", "e1", "e1", "f1", "f1")
id2 <- c("123", "123", "456", "695", "252", "252", "252", "98521", "98521", "895", "895", "9856", "895", "9853", "9853")
df1 <- data.frame (id1, id2)

> df1
   id1   id2
1   a1   123
2   a1   123
3   b1   456
4   b1   695
5   c1   252
6   c1   252
7   c1   252
8   d1 98521
9   d1 98521
10  e1   895
11  e1   895
12  e1  9856
13  e1   895
14  f1  9853
15  f1  9853

I want to extract in a new df both identifiers whose match is not consistent such displayed below.

id1 <- c("b1", "b1", "e1", "e1", "e1", "e1")
id2 <- c("456", "695", "895", "895", "9856", "895")
df2 <- data.frame (id1, id2)

> df2
  id1  id2
1  b1  456
2  b1  695
3  e1  895
4  e1  895
5  e1 9856
6  e1  895

I can't figure out how to do this as the number of "pairs" is not constant and the unmatch is at a random position.

An output like this is also an acceptable option:

id1 <- c("a1", "a1", "b1", "b1", "c1", "c1", "c1", "d1", "d1", "e1", "e1", "e1", "e1", "f1", "f1")
id2 <- c("123", "123", "456", "695", "252", "252", "252", "98521", "98521", "895", "895", "9856", "895", "9853", "9853")
concord <- c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)
df3 <- data.frame (id1, id2, concord)

> df3
   id1   id2 concord
1   a1   123    TRUE
2   a1   123    TRUE
3   b1   456   FALSE
4   b1   695   FALSE
5   c1   252    TRUE
6   c1   252    TRUE
7   c1   252    TRUE
8   d1 98521    TRUE
9   d1 98521    TRUE
10  e1   895   FALSE
11  e1   895   FALSE
12  e1  9856   FALSE
13  e1   895   FALSE
14  f1  9853    TRUE
15  f1  9853    TRUE

Thank you in advance for your help.



Solution 1:[1]

base R

df1$concord <- ave(df1$id2, df1$id1, FUN = function(x) length(unique(x))) == 1

dplyr

You can use dplyr::n_distinct, a wrapper for length(unique(x)):

df1 %>% 
  group_by(id1) %>% 
  mutate(concord = n_distinct(id2) == 1)

output

   id1   id2   concord
 1 a1    123   TRUE   
 2 a1    123   TRUE   
 3 b1    456   FALSE  
 4 b1    695   FALSE  
 5 c1    252   TRUE   
 6 c1    252   TRUE   
 7 c1    252   TRUE   
 8 d1    98521 TRUE   
 9 d1    98521 TRUE   
10 e1    895   FALSE  
11 e1    895   FALSE  
12 e1    9856  FALSE  
13 e1    895   FALSE  
14 f1    9853  TRUE   
15 f1    9853  TRUE

Solution 2:[2]

Base R approach

This can be achieved by splitting df1 by id1 and then check for each id1 consists of >1 unique case per id1. Because split returns a list, we can exploit lapply to perform the task on each element of that list. As the output of lapply is a list as well, we can use do.call to perform a task to the entire list at once (here: combine by rows, rbind).

out <- do.call(
  rbind,
  lapply(split(df1, df1$id1), function(x) cbind(x, concord = nrow(unique(x)) < 2))
)
rownames(out) <- NULL # optional

Output

> head(out)
  id1 id2 concord
1  a1 123    TRUE
2  a1 123    TRUE
3  b1 456   FALSE
4  b1 695   FALSE
5  c1 252    TRUE
6  c1 252    TRUE

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2