'Multirows pair-wised comparison of identifiers
After having bound two larges dataframes for the purpose of record linkage, I want to check the identifiers matching process. My first data output looks like this:
id1 <- c("a1", "a1", "b1", "b1", "c1", "c1", "c1", "d1", "d1", "e1", "e1", "e1", "e1", "f1", "f1")
id2 <- c("123", "123", "456", "695", "252", "252", "252", "98521", "98521", "895", "895", "9856", "895", "9853", "9853")
df1 <- data.frame (id1, id2)
> df1
id1 id2
1 a1 123
2 a1 123
3 b1 456
4 b1 695
5 c1 252
6 c1 252
7 c1 252
8 d1 98521
9 d1 98521
10 e1 895
11 e1 895
12 e1 9856
13 e1 895
14 f1 9853
15 f1 9853
I want to extract in a new df both identifiers whose match is not consistent such displayed below.
id1 <- c("b1", "b1", "e1", "e1", "e1", "e1")
id2 <- c("456", "695", "895", "895", "9856", "895")
df2 <- data.frame (id1, id2)
> df2
id1 id2
1 b1 456
2 b1 695
3 e1 895
4 e1 895
5 e1 9856
6 e1 895
I can't figure out how to do this as the number of "pairs" is not constant and the unmatch is at a random position.
An output like this is also an acceptable option:
id1 <- c("a1", "a1", "b1", "b1", "c1", "c1", "c1", "d1", "d1", "e1", "e1", "e1", "e1", "f1", "f1")
id2 <- c("123", "123", "456", "695", "252", "252", "252", "98521", "98521", "895", "895", "9856", "895", "9853", "9853")
concord <- c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)
df3 <- data.frame (id1, id2, concord)
> df3
id1 id2 concord
1 a1 123 TRUE
2 a1 123 TRUE
3 b1 456 FALSE
4 b1 695 FALSE
5 c1 252 TRUE
6 c1 252 TRUE
7 c1 252 TRUE
8 d1 98521 TRUE
9 d1 98521 TRUE
10 e1 895 FALSE
11 e1 895 FALSE
12 e1 9856 FALSE
13 e1 895 FALSE
14 f1 9853 TRUE
15 f1 9853 TRUE
Thank you in advance for your help.
Solution 1:[1]
base R
df1$concord <- ave(df1$id2, df1$id1, FUN = function(x) length(unique(x))) == 1
dplyr
You can use dplyr::n_distinct, a wrapper for length(unique(x)):
df1 %>%
group_by(id1) %>%
mutate(concord = n_distinct(id2) == 1)
output
id1 id2 concord
1 a1 123 TRUE
2 a1 123 TRUE
3 b1 456 FALSE
4 b1 695 FALSE
5 c1 252 TRUE
6 c1 252 TRUE
7 c1 252 TRUE
8 d1 98521 TRUE
9 d1 98521 TRUE
10 e1 895 FALSE
11 e1 895 FALSE
12 e1 9856 FALSE
13 e1 895 FALSE
14 f1 9853 TRUE
15 f1 9853 TRUE
Solution 2:[2]
Base R approach
This can be achieved by splitting df1 by id1 and then check for each id1 consists of >1 unique case per id1. Because split returns a list, we can exploit lapply to perform the task on each element of that list. As the output of lapply is a list as well, we can use do.call to perform a task to the entire list at once (here: combine by rows, rbind).
out <- do.call(
rbind,
lapply(split(df1, df1$id1), function(x) cbind(x, concord = nrow(unique(x)) < 2))
)
rownames(out) <- NULL # optional
Output
> head(out)
id1 id2 concord
1 a1 123 TRUE
2 a1 123 TRUE
3 b1 456 FALSE
4 b1 695 FALSE
5 c1 252 TRUE
6 c1 252 TRUE
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
