'Select columns based on string match - dplyr::select
I have a data frame ("data") with lots and lots of columns. Some of the columns contain a certain string ("search_string").
How can I use dplyr::select() to give me a subset including only the columns that contain the string?
I tried:
# columns as boolean vector
select(data, grepl("search_string",colnames(data)))
# columns as vector of column names names
select(data, colnames(data)[grepl("search_string",colnames(data))])
Neither of them work.
I know that select() accepts numeric vectors as substitute for columns e.g.:
select(data,5,7,9:20)
But I don't know how to get a numeric vector of columns IDs from my grepl() expression.
Solution 1:[1]
You can try:
select(data, matches("search_string"))
It is more general than contains - you can use regex (e.g. "one_string|or_the_other").
For more examples, see: http://rpackages.ianhowson.com/cran/dplyr/man/select.html.
Solution 2:[2]
No need to use select just use [ instead
data[,grepl("search_string", colnames(data))]
Let's try with iris dataset
>iris[,grepl("Sepal", colnames(iris))]
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Solution 3:[3]
Based on Piotr Migdals response I want to give an alternate solution enabling the possibility for a vector of strings:
myVectorOfStrings <- c("foo", "bar")
matchExpression <- paste(myVectorOfStrings, collapse = "|")
# [1] "foo|bar"
df %>% select(matches(matchExpression))
Making use of the regex OR operator (|)
ATTENTION: If you really have a plain vector of column names (and do not need the power of RegExpression), please see the comment below this answer (since it's the cleaner solution).
Solution 4:[4]
Alternatively using a DataFrame of 22 columns:
library(plyr) # for baseball dataset.
library(dplyr)
baseball %>% colnames() %>% length()
[1] 22
baseball %>% colnames()
[1] "id" "year" "stint" "team" "lg" "g" "ab" "r" "h" "X2b" "X3b" "hr" "rbi"
[14] "sb" "cs" "bb" "so" "ibb" "hbp" "sh" "sf" "gidp"
You can use starts_with("s") and ends_with("b"):
> baseball %>% select(starts_with("s")) %>% head(5)
stint sb so sh sf
4 1 6 1 NA NA # players.columns.str.startswith('p')
44 1 8 0 NA NA
68 1 2 0 NA NA
99 1 4 0 NA NA
102 1 3 0 NA NA
> baseball %>% select(ends_with("b")) %>% head(5)
ab X2b X3b sb bb ibb
4 120 11 3 6 2 NA
44 162 9 4 8 4 NA
68 89 3 1 2 2 NA
99 161 5 1 4 3 NA
102 128 3 7 3 1 NA
# contains("g") matches names that contain āgā.
> baseball %>% select(contains("g")) %>% head(5)
lg g gidp
4 25 NA
44 32 NA
68 19 NA
99 33 NA
102 29 NA
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Jilber Urbina |
| Solution 3 | |
| Solution 4 | rubengavidia0x |
