'Remove a part of variable names in R column
I want to clean up an R variable column to get only the species names. I would like to remove the variable names after the 2nd "_".
This is my table :
| col1 | Col2 |
|---|---|
| Pelagodinium_beii_RCC1491_SRR1300503_MMETSP1338c20 | 4 |
| Acanthoeca_10tr_SRR1294413_MMETSP0105_2c10003_g1_i1 | 5 |
| Rhodosorus_marinus_UTEX-LB-2760_SRR1296985_MMETSP | 5 |
| Vannella_sp._CB-2014_DIVA3-518-3-11-1-6_SRR1296762_M | 3 |
| Florenciella_parvula_CCMP2471_SRR1294437_MMETSP134 | 5 |
I would like to have :
| col1 | Col2 |
|---|---|
| Pelagodinium_beii | 4 |
| Acanthoeca_10tr | 5 |
| Rhodosorus_marinus | 5 |
| Vannella_sp. | 3 |
| Florenciella_parvula | 5 |
I'm not really used to R and I didn't find the right method.
Solution 1:[1]
df$col1 <- sub("^([^_]+_[^_]+)_.*", "\\1", df$col1, perl = TRUE)
df
col1 Col2
1 Pelagodinium_beii 4
2 Acanthoeca_10tr 5
3 Rhodosorus_marinus 5
4 Vannella_sp. 3
5 Florenciella_parvula 5
With df as follows:
df <- read.table(
text =
'col1 Col2
Pelagodinium_beii_RCC1491_SRR1300503_MMETSP1338c20 4
Acanthoeca_10tr_SRR1294413_MMETSP0105_2c10003_g1_i1 5
Rhodosorus_marinus_UTEX-LB-2760_SRR1296985_MMETSP 5
Vannella_sp._CB-2014_DIVA3-518-3-11-1-6_SRR1296762_M 3
Florenciella_parvula_CCMP2471_SRR1294437_MMETSP134 5
',
header = TRUE
)
Solution 2:[2]
An option with strsplit:
df$col1 <- sapply(df$col1, function(i) paste0(strsplit(i, "_")[[1]][1:2], collapse = '_'))
# col1 Col2
# 1 Pelagodinium_beii 4
# 2 Acanthoeca_10tr 5
# 3 Rhodosorus_marinus 5
# 4 Vannella_sp. 3
# 5 Florenciella_parvula 5
Another way would be to use word from stringr package:
library(stringr)
word(df$col1, 1, 2, sep = "_") -> df$col1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Aurèle |
| Solution 2 |
