'How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython
Let's say we have a FASTA file like this:
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
>header2
ASFECAERVA
>header3
ACTGQSDFWGRWTFSH
and this is my desired output:
header1 30
header2 10
header3 16
Please answer this question without using Biopython. I think re.match('^>') can be used here to distinguish the header line and other sequence lines (need to import re first) but I need some helps for the rest part to get the output. Only use python, no Biopython. Thank you!
Here is my current code:
import re
path='/home/try.txt'
count=0
seq=''
with open(path) as d_g:
for line in d_g:
if re.match('^>',line):
count=count+1
print('head',count)
else:
seq=seq+line
Solution 1:[1]
You really don't need regular expressions for this.
header = None
length = 0
with open('file.fasta') as fasta:
for line in fasta:
# Trim newline
line = line.rstrip()
if line.startswith('>'):
# If we captured one before, print it now
if header is not None:
print(header, length)
length = 0
header = line[1:]
else:
length += len(line)
# Don't forget the last one
if length:
print(header, length)
This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python. Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line). A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn't have a new header line.
If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.
Solution 2:[2]
collections.Counter() and a simple loop with state to the rescue.
I augmented your test data to have a multi-line sequence too, just so we can tell things work right.
import io
import collections
# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())
counter = collections.Counter()
header = None
for line in data:
line = line.strip() # deal with newlines
if line.startswith(">"):
header = line[1:]
continue
counter[header] += len(line)
print(counter)
This prints out
Counter({'header1': 60, 'header4': 16, 'header2': 10})
Solution 3:[3]
Do not reinvent the wheel. There are many specialized bioinformatics tools to handle fasta files.
For example, use infoseq utility from the EMBOSS package:
infoseq -auto -only -name -length -noheading file.fasta
Prints:
header1 30
header2 10
header3 16
Install EMBOSS, for example, using conda:
conda create --name emboss emboss
From the man page:
infoseq -help
-only boolean [N] This is a way of shortening the command
line if you only want a few things to be
displayed. Instead of specifying:
'-nohead -noname -noacc -notype -nopgc
-nodesc'
to get only the length output, you can
specify
'-only -length'
-[no]heading boolean [Y] Display column headings
-name boolean [@(!$(only))] Display 'name' column
-length boolean [@(!$(only))] Display 'length' column
Solution 4:[4]
I am following this code, can someone help me to output the header + length into a .txt file please?
this is my current code:
header = None
length = 0
fastaseqs = {}
with open(r'path\file.faa') as fasta:
for line in fasta:
# Trim newline
line = line.rstrip()
if line.startswith('>'):
# If we captured one before, print it now
if header is not None:
print(header, length)
length = 0
header = line[1:]
if header not in fastaseqs:
fastaseqs[header] = []
else:
length += len(line)
if length not in fastaseqs[header]:
fastaseqs[header].append(length)
outfile = open(r'path\db_length.txt')
for header in fastaseqs:
outfile.write('Protein_sequence_name\tlength\n')
outfile.write(header + '\t' + length + '\n')
outfile.close()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | AKX |
| Solution 3 | Timur Shtatland |
| Solution 4 | Daniela Rothschild |
