'how to define categories in R when the string is variable?
I have a long list of gene names in col4 in my table, I want to categorize them in a new column (col6)using R.
my table:
col1 col2 col3 col4 col5
scf001 74212 85524 Jockey-1_DAzt -
scf002 35002 48200 Jockey-4B_DVi +
scf0101 82177 82314 BEL-1_DVir-I -
scf00273 63849 29746 BEL-2_DEl-I +
scf002 71524 73526 Mariner-2_DVi +
scf0101 1172 1372 Mariner-9_DAn -
scf00273 1 4356 ULYSSES_LTR +
here my genes names are highly variable and only part of the name is definable,
I tried this but it doesn't work, the error is that all arguments must have the same length
and how can I put the new categories in the new column?
df[["col6"]]<-0
table(df$col6, c("Jockey"="category1","BEL"="category2","ULYSSESS"="category2","Mariner"="category3"))
Solution 1:[1]
You can use case_when from dplyr and str_detect from stringr.
df2 = df %>%
mutate(col6 = case_when(
str_detect(col4, "Jockey") ~ "category1",
str_detect(col4, "BEL") ~ "category2",
...
TRUE ~ ""))
The second argument in str_detect can be a string or a regular expression.
The last line is to assign a category if nothing matches.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | yuk |
