'I have a fasta file with millions of sequences. I want to only extract those that's names match within a .txt file, how can I go about doing this?

I have been sorting through a ~1.5m read fasta file ('V1_6D_contigs_5kbp.fa') to determine which of the reads are likely to be 'viral' in origin. The reads in this file are denoted as Vx_Cz - where x is 1-6, depending on which trial group it came from, and z is the contig number/name from 1-~1.5m. e.g V1_C10810 or V3_C587937...

Through varying bioinformatic pipelines I have produced a .txt file with a list (2699 long) of the contig names that are predicted (<0.05) to be viral. I now need to use this list of predicted contigs to extract and produce a new fasta file that contains only these contigs.

The theoretical idea behind my code is that it opens the .txt file (names of each significant contig) and the original fasta file, goes through each line of the .txt file and sets the line (contig name) as a variable. It should then loop through the original fasta file which contains all the sequence information and if the contig name matches the record.id (contig name from original file) it should then export the full record information to a new file.

I think I am close, but my current iterations seems to run only one or the the other loop as I expect them to.

Please see the code below. I have added notes below to what runs wrong with each program I have tried.

I am using Python, including SeqIO the Biopython application.

#!/usr/bin/env python

from Bio import SeqIO

Lines = []
Counter = 0 

With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
 original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
 corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
  with open(original_file) as original, open(corrected_file,'w') as corrected:
   records = SeqIO.parse(original_file,'fasta')
    for record in records: 
     id = record.id
     print(“this is the id”, id)
     for line in lines:
      contig = line
      print(“This is the contig”, contig)
       if contig == id:
        SeqIO.write(record, corrected, ‘fasta’)  
         Counter = (counter + 1)

This only prints a list of “this is the id”, id

#!/usr/bin/env python

Lines = []
With open(‘VIBRANT_contigs.txt’) as f:
 lines = f.readlines()
     for line in lines:
      contig = line
      print(contig)

This works and prints the list of contigs within Vibrant_contigs.txt

>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    for record in records:
...     id = record.id
...     if contig == id:
...      print("we got a hit")

Prints nothing


>>> from Bio import SeqIO
>>> lines = []
>>> with open('VIBRANT_contigs.txt') as f:
...  lines = f.readlines()
...  original_file = r"/home/ggb_joshcole/Documents/VxCx_deduped/V1_6D_contigs_5kbp.fa"
...  corrected_file = r"/home/ggb_joshcole/Documents/output.fna"
...  with open(original_file) as original, open(corrected_file,'w') as corrected:
...   records = SeqIO.parse(original_file,'fasta')
...   for line in lines:
...    contig = line
...    print("this is the contig", contig)
...    for record in records:
...     id = record.id
...     print("this is the id", id)
...     if contig == id:
...      print("we got a hit")

Prints ‘this is the contig V1_C1005406’ (the first contig in the list)

Then proceeds to print all the ids

Then just print all the contigs

Any help would be greatly appreciated as I am very stuck.

Thank you for your time.

python biopython

Solution 1:^[1]

Among quite a few typos, the main issue is that the line from lines=f.readlines() will still contain the newline character \n and will therefore never match the id from SeqIO, the solution is to use a simple strip() call:

from Bio import SeqIO

Lines = []
Counter = 0

with open("ids_to_extract.txt") as f:
    lines = f.readlines()
    original_file = r"input_sequences.fasta"
    corrected_file = r"extracted_sequences.fasta"
    with open(original_file) as original, open(corrected_file,'w') as corrected:
        records = SeqIO.parse(original_file,'fasta')
        for record in records:
            id = record.id
            print("this is the id", id)
            for line in lines:
                contig = line.strip() # <--- fixed here
                print("This is the contig", contig)
                if contig == id:
                    SeqIO.write(record, corrected, 'fasta')
                    Counter = (Counter + 1)

The input data is like this:

input_sequences.fasta:

>SRR10971381.1 1 length=151
CAAAGTCAAGCTCCCTTCTGCCTTTACACTCTTCGAGCGATTTCCGTCCGCTCTGAGGGAACCTTTGGGCGCCTCCGTTACTCTTTTGGAGGCGACCGCCCCAGTCAAACTGCCCGCCTAAGACTGTCCGGCCGGTCNTTACTCGGCNCGT
>SRR10971381.2 2 length=151
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>SRR10971381.3 3 length=47
TACGATGTCGTTCCGGAAGGTAAGGGCGTGGATGAGACTAAGATCGG
>SRR10971381.4 4 length=149
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
[....]

ids_to_extract: (SeqIo splits at whitespaces, so the id should only be up to the first whitespaces)

SRR10971381.1
SRR10971381.2
SRR10971381.3
SRR10971381.4

That said, this code is still very inefficient. For every contig, you are iterating over all lines of the file. Much easier would be to generate a set of all ids that you need and then just check if it the id is in the set, increasing the average runtime for the check from O(n) to O(1) (reference):

from Bio import SeqIO

Lines = []
Counter = 0

#Pre-Read all ids to extract into a set
with open("ids_to_extract.txt") as f:
    ids_to_extract = {line.strip() for line in f.readlines()}

original_file = r"input_sequences.fasta"
corrected_file = r"extracted_sequences.fasta"
with open(original_file) as original, open(corrected_file,'w') as corrected:
    records = SeqIO.parse(original_file,'fasta')
    for record in records:
        recordId = record.id
        # Now a simplle `in` check
        if recordId in ids_to_extract:
            SeqIO.write(record, corrected, 'fasta')
            Counter = (Counter + 1)

Note also that there are several other fast an easy options to perform this operation:

# using Heng Li's seqkit
seqkit grep -f ids_to_extract.txt input_sequences.fasta
# using bioawk
bioawk -c fastx -v file="ids_to_extract.txt" 'BEGIN{while((getline k < file)>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}'  input_sequences.fasta > extracted_sequences.fasta

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	FlyingTeller

'I have a fasta file with millions of sequences. I want to only extract those that's names match within a .txt file, how can I go about doing this?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]