'How to optimize code for a file with 72 million entries?

I have asked a question on another platform (here) - it would be great to get your input in order to make my Python code run in a very short time. Currently, it has been taking more than 3 hours for a file with millions of entries.

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))
                dicts1[i]=x

        #write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq + UMI
        for k,v in dicts1.items():
            if v not in dicts2.values():
                dicts2[k] = v

        #making a list
        for keys in dicts2:
            lst.append(keys)

    #print(dicts1)
    #print(dicts2)
    #print(lst)


    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #based on the saved ids in the list print the entries (miRNA + 3' adapter + UMI)
            if record.id in lst:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                #print(record.seq)
                #print(miRNAseq.seq)
                #print(adapter_seq.seq)
                #print(umi_seq.seq)
                #print("@"+record.id)
                if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
                    print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
                    cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
                    print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')

if __name__ == '__main__':
    QIAseq_UMI_correction()


Solution 1:[1]

Why not have one reading, parsing and looping of the file? I have moved the code of the second loop to the first, am I missing something? Why loop twice?

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    sentinel = 100

    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):

            # only for testing
            if sentinel < 0:
                break
            sentinel -= 1

            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))

                if x not in dicts2:
                    if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
                        print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                    if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
                        cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
                        print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')

                dicts1[i]=x
                dicts2[x]=i


if __name__ == '__main__':
    QIAseq_UMI_correction()

Other suggestions:

As mentioned in comments you could time major steps to see where time can be shortened. Check out timeit for example. My suggestion is to time the SeqIO.parse, if "AACTGTAGGCACCATCAAT" in record.seq:

I suspect that a major chunk of time is spent in parsing with SeqIO.parse so your should use this once ideally.

A final suggestion is to use a smaller set of records until you have your code ready with what you need it to do. I have added a sentinel variable as an example to break out of the loop when 100 matching records have been explored.

Solution 2:[2]

The things that I see on a first pass are these:

First where you check

if "AACTGTAGGCACCATCAAT" in record.seq:
    adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')

you can use the following to avoid searching through the sequence twice.

adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
if adapter_pos != -1: # check that sequence is found

The lines:

i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x

can be changed to:

dicts1[record.id]=miRNAseq.seq+umi_seq.seq

next the lines:

for k,v in dicts1.items():
    if v not in dicts2.values():
        dicts2[k] = v

can be changed to

dict2 = {**dict1,**dict2}

However the that change might actually come at a performance cost I'm not sure.

Next is that

for keys in dicts2:
    lst.append(keys)

can be deleted and

lst = list(dicts2.keys())

can be added outside and after the first pass through (where you have the commented out print statements.)

Finally as @Roeften suggests you can put the second chunk of code in with the first to avoid going through the whole file twice.

At least some of these suggestions will help but in reality python just isn't that fast of a language and if you want to do this sort of analysis regularly you might consider using something faster.

Solution 3:[3]

I would use partition-method here to your own answer.

from Bio import SeqIO
import sys

def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts = {}
    lst = []

    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            
            miRNAseq, adapter_seq, umi_seq = str(record.seq).partition("AACTGTAGGCACCATCAAT")

            if adapter_seq:
                x = miRNAseq.seq+umi_seq.seq
                y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))
                dicts[miRNAseq + umi_seq] = record.id

                    if len(y) <= 50:
                        print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                    else:
                        cut = len(y) - 50
                        print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
    QIAseq_UMI_correction()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Vovin