'Remove duplicate sequences of unequal lenght in a fasta file retaining the largest sequence
I am working with a fasta file that contains sequences of different lengths for multiple organisms. I would want to filter sequences retaining only the largest ones, as well as retaining those with different genus &/or species names (but filtering those with other variable attributes such as GenBank accession number or Country).
My fasta file looks like this:
>Species_one_KY256987_Norway
ATTGCGTTTAATTTGCGCCA
>Species_one_KY256988_USA
ATTGCGTTTAATTTGCGCCATTTCGCTT
>Species_one_KY256989_USA
ATTGCGTTTAATTTGCGCCATTTCGCTC
>Genus_two_KY256990_Italy
ATTGCGTTTAATTTGCGCCA
>Species_three_KY256991_Australia
TTGGACTAAATGGATTACCCTTAATATA
>Species_three_KY256992_Australia
TTGGACTAAATGGATTACCCTT
I would want to filter the sequences "Species_one_KY256987_Norway" and "Species_three_KY256992_Australia" (identical but shorter than "Species_one_KY256988_USA" and "Species_three_KY256991_Australia", respectively); AND maintain "Genus_two_KY256990_Italy" even if the sequence is identical and shorter than "Species_one_KY256988_USA" and "Species_one_KY256989_USA", but it has a different genus &/or species name.
The expected output would be:
>Species_one_KY256988_USA
ATTGCGTTTAATTTGCGCCATTTCGCTT
>Species_one_KY256989_USA
ATTGCGTTTAATTTGCGCCATTTCGCTC
>Genus_two_KY256990_Italy
ATTGCGTTTAATTTGCGCCA
>Species_three_KY256991_Australia
TTGGACTAAATGGATTACCCTTAATATA
Many thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
