'how to define categories in R when the string is variable?

I have a long list of gene names in col4 in my table, I want to categorize them in a new column (col6)using R.

my table:

col1      col2     col3     col4          col5
scf001    74212    85524    Jockey-1_DAzt   -
scf002    35002    48200    Jockey-4B_DVi   +
scf0101   82177    82314    BEL-1_DVir-I    -
scf00273  63849    29746    BEL-2_DEl-I     +
scf002    71524    73526    Mariner-2_DVi   +
scf0101   1172     1372     Mariner-9_DAn   -
scf00273  1        4356     ULYSSES_LTR     +

here my genes names are highly variable and only part of the name is definable, I tried this but it doesn't work, the error is that all arguments must have the same length

and how can I put the new categories in the new column?

df[["col6"]]<-0
table(df$col6, c("Jockey"="category1","BEL"="category2","ULYSSESS"="category2","Mariner"="category3"))
r


Solution 1:[1]

You can use case_when from dplyr and str_detect from stringr.

df2 = df %>%
  mutate(col6 = case_when(
    str_detect(col4, "Jockey") ~ "category1",
    str_detect(col4, "BEL") ~ "category2",
    ...
    TRUE ~ ""))

The second argument in str_detect can be a string or a regular expression.

The last line is to assign a category if nothing matches.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 yuk