'How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython

Let's say we have a FASTA file like this:

>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
>header2
ASFECAERVA
>header3
ACTGQSDFWGRWTFSH

and this is my desired output:

header1 30
header2 10
header3 16

Please answer this question without using Biopython. I think re.match('^>') can be used here to distinguish the header line and other sequence lines (need to import re first) but I need some helps for the rest part to get the output. Only use python, no Biopython. Thank you!

Here is my current code:

import re
path='/home/try.txt'
count=0
seq=''
with open(path) as d_g:
    for line in d_g:
        if re.match('^>',line):
            count=count+1
            print('head',count)
        else:
            seq=seq+line

Solution 1:^[1]

You really don't need regular expressions for this.

header = None
length = 0
with open('file.fasta') as fasta:
    for line in fasta:
        # Trim newline
        line = line.rstrip()
        if line.startswith('>'):
            # If we captured one before, print it now
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)
# Don't forget the last one
if length:
    print(header, length)

This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python. Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line). A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn't have a new header line.

If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.

Solution 2:^[2]

collections.Counter() and a simple loop with state to the rescue.

I augmented your test data to have a multi-line sequence too, just so we can tell things work right.

import io
import collections

# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())


counter = collections.Counter()

header = None
for line in data:
    line = line.strip()  # deal with newlines
    if line.startswith(">"):
        header = line[1:]
        continue
    counter[header] += len(line)

print(counter)

This prints out

Counter({'header1': 60, 'header4': 16, 'header2': 10})

Solution 3:^[3]

Do not reinvent the wheel. There are many specialized bioinformatics tools to handle fasta files.

For example, use infoseq utility from the EMBOSS package:

infoseq -auto -only -name -length -noheading file.fasta

Prints:

header1        30     
header2        10     
header3        16

Install EMBOSS, for example, using conda:

conda create --name emboss emboss

From the man page:

infoseq -help
   -only               boolean    [N] This is a way of shortening the command
                                  line if you only want a few things to be
                                  displayed. Instead of specifying:
                                  '-nohead -noname -noacc -notype -nopgc
                                  -nodesc'
                                  to get only the length output, you can
                                  specify
                                  '-only -length'
   -[no]heading        boolean    [Y] Display column headings
   -name               boolean    [@(!$(only))] Display 'name' column
   -length             boolean    [@(!$(only))] Display 'length' column

Solution 4:^[4]

I am following this code, can someone help me to output the header + length into a .txt file please?

this is my current code:

header = None
length = 0
fastaseqs = {} 
with open(r'path\file.faa') as fasta:
    for line in fasta:
        # Trim newline
        line = line.rstrip()
        if line.startswith('>'):
            # If we captured one before, print it now
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
            if header not in fastaseqs:
                fastaseqs[header] = []
        else:
            length += len(line)
            
            if length not in fastaseqs[header]:
                fastaseqs[header].append(length)

outfile = open(r'path\db_length.txt')
for header in fastaseqs:
    outfile.write('Protein_sequence_name\tlength\n')
    outfile.write(header + '\t' + length + '\n')
outfile.close()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	AKX
Solution 3	Timur Shtatland
Solution 4	Daniela Rothschild

'How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]