'R group by show count of all factor levels even when zero dplyr
set.seed(1)
dat <- data.frame(ID = sample(letters,50,rep=TRUE))
dat %>%
group_by(ID) %>%
summarise(no_rows = length(ID))
I have the above code which creates a random sample of letters. However can I make the summarised output show all count levels even when there is zero.
When I run the above code sometimes I get 20 rows and sometimes I get 25, etc. I want this to return 26 rows every time.
Solution 1:[1]
In the accepted answer by akrun, table() works, but the tidyverse answer gives inaccurate counts (see below). Instead use the .drop = FALSE option:
library(tidyverse)
set.seed(1)
dat <- data.frame(ID = sample(letters,50,rep=TRUE))
dat %>%
mutate(ID = factor(ID, levels = letters)) %>%
count(ID, name = "no_rows", .drop = F) %>%
print.data.frame()
#> ID no_rows
#> 1 a 3
#> 2 b 2
#> 3 c 1
#> 4 d 1
#> 5 e 3
#> 6 f 3
#> 7 g 2
#> 8 h 1
#> 9 i 2
#> 10 j 5
#> 11 k 1
#> 12 l 3
#> 13 m 0
#> 14 n 3
#> 15 o 3
#> 16 p 0
#> 17 q 0
#> 18 r 1
#> 19 s 1
#> 20 t 3
#> 21 u 3
#> 22 v 1
#> 23 w 2
#> 24 x 0
#> 25 y 5
#> 26 z 1
Created on 2019-11-22 by the reprex package (v0.3.0)
Note that we expect nonzero counts for all letters but m, p, q, and x:
set.seed(1)
dat <- data.frame(ID = sample(letters,50,rep=TRUE))
levels(dat$ID)
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "n" "o" "r" "s" "t"
#> [18] "u" "v" "w" "y" "z"
But if we use complete() we get ones instead:
set.seed(1)
dat <- data.frame(ID = sample(letters,50,rep=TRUE))
dat %>%
mutate(ID=factor(ID, levels = letters)) %>%
complete(ID) %>%
group_by(ID) %>%
summarise(no_rows = n()) %>%
print.data.frame()
#> ID no_rows
# ...
#> 12 l 3
#> 13 m 1 # should be 0
#> 14 n 3
#> 15 o 3
#> 16 p 1 # should be 0
#> 17 q 1 # should be 0
#> 18 r 1
#> 19 s 1
#> 20 t 3
#> 21 u 3
#> 22 v 1
#> 23 w 2
#> 24 x 1 # should be 0
#> 25 y 5
#> 26 z 1
That's because complete() actually adds a single m, p, q, and x to ID so it contains at least one of each letter.
Solution 2:[2]
This will also work:
set.seed(1)
dat <- data.frame(ID = sample(letters,50,rep=TRUE))
levels(dat$ID) <- letters
as.data.frame(dat %>% xtabs(formula = ~ ID))
# ID Freq
#1 a 2
#2 b 1
#3 c 1
#4 d 1
#5 e 2
#6 f 3
#7 g 2
#8 h 1
#9 i 4
#10 j 2
#11 k 4
#12 l 1
#13 m 2
#14 n 1
#15 o 3
#16 p 3
#17 q 3
#18 r 1
#19 s 5
#20 t 2
#21 u 1
#22 v 2
#23 w 2
#24 x 1
#25 y 0
#26 z 0
Solution 3:[3]
Using the accepted solution by akrun I get a wrong result. I expect a frequency table with all letters with no_rows = 0 except "a" with no_rows = 1.
library(tidyverse)
set.seed(1)
dat <- data.frame(ID = "a")
dat %>%
dplyr::mutate(ID=factor(ID, levels = letters)) %>%
tidyr::complete(ID) %>%
dplyr::group_by(ID) %>%
dplyr::summarise(no_rows = n())
Instead I get a frequency table with no_rows = 1 for all letters:
ID no_rows
<fct> <int>
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
7 g 1
8 h 1
9 i 1
10 j 1
# ... with 16 more rows
Solution 4:[4]
The tidyverse solution by akrun can be corrected as follows :
set.seed(1)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.5
dat <- data.frame(ID = sample(letters,50,rep=TRUE))
dat %>%
mutate(ID=factor(ID, levels = letters)) %>%
group_by(ID) %>%
summarise(no_rows = n()) %>%
complete(ID, fill = list(no_rows = 0)) %>%
print.data.frame()
#> `summarise()` ungrouping output (override with `.groups` argument)
#> ID no_rows
#> 1 a 3
#> 2 b 2
#> 3 c 1
#> 4 d 1
#> 5 e 3
#> 6 f 3
#> 7 g 2
#> 8 h 1
#> 9 i 2
#> 10 j 5
#> 11 k 1
#> 12 l 3
#> 13 m 0
#> 14 n 3
#> 15 o 3
#> 16 p 0
#> 17 q 0
#> 18 r 1
#> 19 s 1
#> 20 t 3
#> 21 u 3
#> 22 v 1
#> 23 w 2
#> 24 x 0
#> 25 y 5
#> 26 z 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nick |
| Solution 2 | Sandipan Dey |
| Solution 3 | |
| Solution 4 | slowowl |
