'R dplyr: count number of shared/matching rows by group?
I have a data frame:
site <- c("a1","a1","a1","a1","a1","a1","b1","b1","b1","b1","b1","b1","c1","c1","c1","c1","c1","c1")
year <- c(2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020)
sampleID <- c("a1.2018.1","a1.2019.1","a1.2020.1","a1.2018.1","a1.2019.1","a1.2020.1","b1.2018.1","b1.2019.1","b1.2020.1","b1.2018.1","b1.2019.1","b1.2020.1",
"c1.2018.1","c1.2019.1","c1.2020.1","c1.2018.1","c1.2019.1","c1.2020.1")
method <- c("a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","b","b","b")
genus <- c("g1","g2","g3","g1","g4","g5","g2","g3","g4","g1","g2","g3","g1","g4","g5","g2","g3","g4")
df <- data.frame(site,year, sampleID, method, genus)
site year sampleID method genus
1 a1 2018 a1.2018.1 a g1
2 a1 2019 a1.2019.1 a g2
3 a1 2020 a1.2020.1 a g3
4 a1 2018 a1.2018.1 b g1
5 a1 2019 a1.2019.1 b g4
6 a1 2020 a1.2020.1 b g5
7 b1 2018 b1.2018.1 a g2
8 b1 2019 b1.2019.1 a g3
9 b1 2020 b1.2020.1 a g4
10 b1 2018 b1.2018.1 b g1
11 b1 2019 b1.2019.1 b g2
12 b1 2020 b1.2020.1 b g3
13 c1 2018 c1.2018.1 a g1
14 c1 2019 c1.2019.1 a g4
15 c1 2020 c1.2020.1 a g5
16 c1 2018 c1.2018.1 b g2
17 c1 2019 c1.2019.1 b g3
18 c1 2020 c1.2020.1 b g4
I want to obtain a count of the number of shared genera (genus column) and the total richness (total number of genera detected by both methods) grouped by site, year, sampleID, and method. My end goal is to calculate the % matching genera by the groups listed above. My ideal output would look like this:
site year sampleID shared total
1 a1 2018 a1.2018.1 x y
2 a1 2019 a1.2019.1 x y
3 a1 2020 a1.2020.1 x y
4 b1 2018 b1.2018.1 x y
5 b1 2019 b1.2019.1 x y
6 b1 2020 b1.2020.1 x y
7 c1 2018 c1.2018.1 x y
8 c1 2019 c1.2019.1 x y
9 c1 2020 c1.2020.1 x y
How would I do this (ideally using a dplyr pipeline?) For example I created this dataframe by combining two separate df's for each method (a,b) by doing this:
test <- combined_df %>%
group_by(siteID, year, sampleID, method, genus) %>%
filter(!is.na(genus)) %>%
summarise(count = n_distinct(genus))
This gave me just the unique genera by group, but how would I count the number of matching genera and the total richness?
Solution 1:[1]
Your test is really close. You can ungroup the method then sum the genera if I understand your question correctly.
output <- df %>%
group_by(site, year, sampleID, method, genus) %>%
filter(!is.na(genus)) %>%
summarise(shared = n_distinct(genus)) %>%
ungroup(method) %>%
mutate(total = sum(shared))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DylanMG |
