'Remove duplicate sequences of unequal lenght in a fasta file retaining the largest sequence

I am working with a fasta file that contains sequences of different lengths for multiple organisms. I would want to filter sequences retaining only the largest ones, as well as retaining those with different genus &/or species names (but filtering those with other variable attributes such as GenBank accession number or Country).

My fasta file looks like this:

>Species_one_KY256987_Norway
ATTGCGTTTAATTTGCGCCA
>Species_one_KY256988_USA
ATTGCGTTTAATTTGCGCCATTTCGCTT
>Species_one_KY256989_USA
ATTGCGTTTAATTTGCGCCATTTCGCTC
>Genus_two_KY256990_Italy
ATTGCGTTTAATTTGCGCCA
>Species_three_KY256991_Australia
TTGGACTAAATGGATTACCCTTAATATA
>Species_three_KY256992_Australia
TTGGACTAAATGGATTACCCTT

I would want to filter the sequences "Species_one_KY256987_Norway" and "Species_three_KY256992_Australia" (identical but shorter than "Species_one_KY256988_USA" and "Species_three_KY256991_Australia", respectively); AND maintain "Genus_two_KY256990_Italy" even if the sequence is identical and shorter than "Species_one_KY256988_USA" and "Species_one_KY256989_USA", but it has a different genus &/or species name.

The expected output would be:

>Species_one_KY256988_USA
ATTGCGTTTAATTTGCGCCATTTCGCTT
>Species_one_KY256989_USA
ATTGCGTTTAATTTGCGCCATTTCGCTC
>Genus_two_KY256990_Italy
ATTGCGTTTAATTTGCGCCA
>Species_three_KY256991_Australia
TTGGACTAAATGGATTACCCTTAATATA

Many thanks!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Remove duplicate sequences of unequal lenght in a fasta file retaining the largest sequence

Sources

Related Questions