'How to modify headers of a fasta file and merge it with the information of another file?
I have a huge fasta file that has this form:
HQ323811.1 Abies alba tRNA-Leu (trnL) gene, intron; chloroplast GGGCAATCCTGAGCCAAATCCGGTTCATAGAGAAAAGGGTTTCTCTCCTTCTCCTAAGGA AAGGGATAGGTGCAGAGACTCAATGG
Then, I have another file that contain the taxonomic information of each organisms inside my fasta file.
I would like to obtain a final fasta file which contain only the scientific name of the species and then the taxonomic information. Is there a way to do this? I have no idea! Could someone please tell me if is there a tutorial or something I can read to try to do it?
Thank you,
Solution 1:[1]
Let there be a file in.fasta containing
>HQ323811.1 Abies alba tRNA-Leu (trnL) gene, intron; chloroplast
GGGCAATCCTGAGCCAAATCCGGTTCATAGAGAAAAGGGTTTCTCTCCTTCTCCTAAGGA AAGGGATAGGTGCAGAGACTCAATGG
And a file tax.txt containing one species per line e.g.
Abies alba cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Acrogymnospermae; Pinopsida; Pinidae; Conifers I; Pinales; Pinaceae; Abies
Then you can run this script in the R programming language
library(tidyverse)
library(tidysq)
lineages <-
# Taxon might have multiple intermediate ranks e.g. Conifers I
# Can not assume just species, genus, family, order, class, phylum, kingdom
# parsing required
read_lines("tax.txt") %>%
tibble(lineage = .) %>%
mutate(
# first two words
species = lineage %>% str_extract("^[A-z0-9]+ [A-z0-9]+"),
# last word
genus = lineage %>% str_extract("[A-z0-9]+$")
)
lineages
read_fasta("in.fasta") %>%
mutate(species = name %>% str_remove("^[A-z0-9.]+") %>% str_extract("[A-z0-9]+ [A-z0-9]+")) %>%
left_join(lineages) %>%
transmute(
sq,
name = str_glue("{row_number()}; g:{genus}, s:{species}")
) %>%
{
.x <- .
write_fasta(.x$sq, .x$name, "out.fasta")
}
resulting in a file out.fasta containing
>1; g:Abies, s:Abies alba
GGGCAATCCTGAGCCAAATCCGGTTCATAGAGAAAAGGGTTTCTCTCCTTCTCCTAAGGA AAGGGATAGGTGCAGAGAC
TCAATGG
You can also create a file headers.txt containing the final headers. The first line is the new name of the first sequence of in.fasta and so on:
1tax=f:Pinaceae;g:Abies;s:alba
The corresponding R script will then be:
library(tidyverse)
library(tidysq)
headers <- read_lines("headers.txt")
read_fasta("in.fasta") %>%
mutate(name = headers) %>%
{
.x <- .
write_fasta(.x$sq, .x$name, "out.fasta")
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
