'Dplyr filter behaviour when using vector [duplicate]
I use dplyr quite a lot for data wrangling, but I never figured out dplyr filter behaviour when using filter(df, variable == c(value1, value2)
Lets use iris data set as an example.
library(dplyr)
data(iris)
# I want to filter by Species 'setosa' and 'versicolor'
# Solution 1
filter1 <- filter(iris, Species == 'setosa' | Species == 'versicolor')
nrow(filter1)
[1] 100 # expected result
# Solution 2
filter2 <- filter(iris, Species %in% c('setosa', 'versicolor'))
nrow(filter2)
[1] 100 # expected result
filter1 == filter2 # both solutions return the exact same result
#Solution 3
filter3 <- filter(iris, Species == c('setosa', 'versicolor'))
nrow(filter3)
[1] 50 # unexpected result
unique(filter3$Species)
[1] setosa versicolor
Levels: setosa versicolor virginica
Although Solution 3 is filtering for the intended species, as shown by unique(filter3$Species), it only returns half of the occurrences (50 compared to 100 in Solution 1and Solution2). I would appreciate some guidance on what is actually going on in Solution 3.
Solution 1:[1]
filter(iris, Species == c("versicolor", "setosa")) does not make sense in an intuitive way, because one Species is not a 2-tuple:
> "setosa" == c("setosa", "versicolor")
[1] TRUE FALSE
Interestingly, filter(iris, Species == c("setosa", "versicolor")) produce the same results: The first Species of the data frame will be returned, so descending sorting will give you versicolor:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
as_tibble()
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
iris %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(Species) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.4 3.2 4.5 1.5 versicolor
#> 2 5.5 2.3 4 1.3 versicolor
#> 3 5.7 2.8 4.5 1.3 versicolor
#> 4 4.9 2.4 3.3 1 versicolor
#> 5 5.2 2.7 3.9 1.4 versicolor
#> 6 5.9 3 4.2 1.5 versicolor
#> 7 6.1 2.9 4.7 1.4 versicolor
#> 8 6.7 3.1 4.4 1.4 versicolor
#> 9 5.8 2.7 4.1 1 versicolor
#> 10 5.6 2.5 3.9 1.1 versicolor
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.9 3.1 4.9 1.5 versicolor
#> 3 6.5 2.8 4.6 1.5 versicolor
#> 4 6.3 3.3 4.7 1.6 versicolor
#> 5 6.6 2.9 4.6 1.3 versicolor
#> 6 5 2 3.5 1 versicolor
#> 7 6 2.2 4 1 versicolor
#> 8 5.6 2.9 3.6 1.3 versicolor
#> 9 5.6 3 4.5 1.5 versicolor
#> 10 6.2 2.2 4.5 1.5 versicolor
#> # … with 40 more rows
Created on 2022-02-11 by the reprex package (v2.0.0)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | danlooo |
