'How to modify headers of a fasta file and merge it with the information of another file?

I have a huge fasta file that has this form:

HQ323811.1 Abies alba tRNA-Leu (trnL) gene, intron; chloroplast GGGCAATCCTGAGCCAAATCCGGTTCATAGAGAAAAGGGTTTCTCTCCTTCTCCTAAGGA AAGGGATAGGTGCAGAGACTCAATGG

Then, I have another file that contain the taxonomic information of each organisms inside my fasta file.

I would like to obtain a final fasta file which contain only the scientific name of the species and then the taxonomic information. Is there a way to do this? I have no idea! Could someone please tell me if is there a tutorial or something I can read to try to do it?

Thank you,



Solution 1:[1]

Let there be a file in.fasta containing

>HQ323811.1 Abies alba tRNA-Leu (trnL) gene, intron; chloroplast
GGGCAATCCTGAGCCAAATCCGGTTCATAGAGAAAAGGGTTTCTCTCCTTCTCCTAAGGA AAGGGATAGGTGCAGAGACTCAATGG

And a file tax.txt containing one species per line e.g.

Abies alba cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Acrogymnospermae; Pinopsida; Pinidae; Conifers I; Pinales; Pinaceae; Abies

Then you can run this script in the R programming language

library(tidyverse)
library(tidysq)

lineages <-
  # Taxon might have multiple intermediate ranks e.g. Conifers I
  # Can not assume just species, genus, family, order, class, phylum, kingdom
  # parsing required
  read_lines("tax.txt") %>%
  tibble(lineage = .) %>%
  mutate(
    # first two words
    species = lineage %>% str_extract("^[A-z0-9]+ [A-z0-9]+"),
    # last word
    genus = lineage %>% str_extract("[A-z0-9]+$")
  )
lineages

read_fasta("in.fasta") %>%
  mutate(species = name %>% str_remove("^[A-z0-9.]+") %>% str_extract("[A-z0-9]+ [A-z0-9]+")) %>%
  left_join(lineages) %>%
  transmute(
    sq,
    name = str_glue("{row_number()}; g:{genus}, s:{species}")
  ) %>%
  {
    .x <- .
    write_fasta(.x$sq, .x$name, "out.fasta")
  }

resulting in a file out.fasta containing

>1; g:Abies, s:Abies alba
GGGCAATCCTGAGCCAAATCCGGTTCATAGAGAAAAGGGTTTCTCTCCTTCTCCTAAGGA AAGGGATAGGTGCAGAGAC
TCAATGG

You can also create a file headers.txt containing the final headers. The first line is the new name of the first sequence of in.fasta and so on:

1tax=f:Pinaceae;g:Abies;s:alba

The corresponding R script will then be:

library(tidyverse)
library(tidysq)

headers <- read_lines("headers.txt")

read_fasta("in.fasta") %>%
  mutate(name = headers) %>%
  {
    .x <- .
    write_fasta(.x$sq, .x$name, "out.fasta")
  }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1