'python3: Read huge file (~ 800 Gb), split the lines by condition, and append them to the end of new files in a memory-efficient way

I’m learning python 3, and I’m dealing with a huge txt file (~800Gb).
The enclosed function 'kmers_dic' while it read the main file, if the condition in if statement is satisfied, it should append the line in the previously created files (these files are 1024 and they are named with content of the kmers variable). The function work fine with a subset of the principal file, but when I run the code using the main file, my job is killed because I reached a memory usage limit.

def OpenFiles(i):
    '''
    A switch to handle file opening and reduce duplicated code
    '''
    open_method = {
        "gz": gzip.open,
        "norm": open
    }
    return open_method[i]

def rows(f, chunksize=102400, sep='\n'):
        """
        Read a file where the row separator is '\n' lazily.
        Default chunk size: 102400kB 100Mb.
        Usage:
        >>> with open('big.csv') as f:
        >>>     for r in rows(f):
        >>>         process(r)
        """
        curr_row = ''
        while True:
            chunk = f.read(chunksize)
            if chunk == '': # End of file
                break
            while True:
                i = chunk.find(sep)
                if i == -1:
                    break
                yield curr_row + chunk[:i]
                curr_row = ''
                chunk = chunk[i+1:]
            curr_row += chunk
    
            
    def kmers_dic(input_file,kmers,out_dir):
        '''
            file writing by kmers
        '''
        #kmers_dic = set()
        count_line=0
        count_line_1=0
        if input_file.endswith('.gz'):
            nano_read = OpenFiles('gz')
        else:
            nano_read = OpenFiles('norm')
        
        with nano_read(input_file, 'rt') as nano_f:
            chunk = rows(nano_f,chunksize=2024,sep='\n')
            for line in chunk:
                
                count_line+=1
                count_line_1+=1
                
                sys.stdout.write('%s\r' % count_line)
                sys.stdout.flush()
                
                line = line.strip('\n')
                line = line.split()
                if line[2] in kmers:
                    kmer = line[2]
                    Out_f_name = out_dir+line[2]+'.lib'
                    file1 = open(Out_f_name, 'a')
                    ##file1.write('\t'.join(line) + '\n') # print entire line
                    file1.write('\t'.join(line[1:4:]+line[6:9:]+line[9:13:]+line[15:]) + '\n')
                    file1.close()
        print("lines: ",count_line_1)

I'm not understanding where is the issue. Can you help me ?
Thanks in advance!
Best.



Solution 1:[1]

curr_row += chunk causes you keep all chunks in memory until you run out of free memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mojeto