'How to More Efficiently Parse a Large File in Python to Cut Down on Run Time?

I have a large text file (2,000,000+ lines) that I am attempting to read into a dataframe and then write to a csv file ultimately so it can be read into a database. Just wanted to state up front that I did not create the file and I cannot have the person who created the file modify it in any way, essentially I am stuck with the file as is and my goal is to parse it into a more usable formate. My code seems to work (I tested it on smaller subsegments of files) but when I attempt to process an entire file with over 2,000,000 rows, the program takes hours and hours to complete (like almost 20 hours!). I am trying to get some advice on how I could make this program run a lot faster as I have quite a few files to process.

Also, I am trying to improve my question writing so I have tried to include more detail about the problem, including snippets of code and an explanation of what I am trying to accomplish with each snippet. Please note that the code as it exists in full is listed at the end of my question as well as a sample file in full. The sample file has lines missing so the balances will not line up.

The current file is a list of account holders with their name and address, followed by accounts that they have and the transactions that occurred on the account over time. Once all the transactions have been listed for all the accounts a person has, the file then moves on to a new person and again lists all of their information (Name, address) and then all the transactions for their accounts. For each new account, there is a new header line that is either DATE DESCRIPTION DEPOSIT WITHDRAWAL BALANCE for a deposit type of account or DATE DESCRIPTION PRINCIPAL INTEREST BALANCE for a Loan type of account. Within the list of transactions, there are some transactions that are actual deposit or withdrawals or loan repayment transactions. There are also some BALANCE FORWARD transactions which denote the starting balance for an account at the beginning of the period. These balance forward transactions signify switching to another account for the same person. Lastly, there are some transactions that don’t have a transaction amount and are essentially just a continuation of a description from the last transaction on the previous line.

The following are some examples of what the file looks like and some accompanying code that I use to parse these pieces of information:

Header to start off file:

ABC COMPANY                                                     
JAN 01 2019 TO DEC 31 2019                                                      
000000001                                                                      
                      ABC COMPANY                               
                                                                                
                                                                                
      JOHN SMITH                                                          
      176 HUGH CRES                                                         
      NOWHERE ON                                                                 
      Z9Z 9Z9

I generally ignore the first two lines of the file, the 3rd line is an account number, which I store in prev_line. The 4th line is a header of ABC COMPANY which denotes that there is an account holder's name and address that will follow after 2 blank lines. There may be multiple people on the account, so my code just reads in the first lines of this section as the primary name on the account and the rest is just grouped into “Other Information”, the following are the columns of the dataframe (df) and how I read in a person’s info into this dataframe for the acct_num, mem_name and other_mem_info columns:

df = pd.DataFrame(columns=['acct_num', 'mem_name', 'other_mem_info', 'acct_name', 'acct_type', 'trxn_date', \
                          'trxn_description', 'deposit', 'withdrawal', 'principal', 'interest', 'balance'])


if i > 1:
   prev_line = line.strip()
line = next(file)

# New person header
if line.strip() == new_person_header and i > 2:
   # Skip 2 blank lines after header
   blank = next(file)
   blank = next(file)
   i += 2

   # Account Number stored in prev_line, followed by name and remaining other info
   person_info.append(prev_line)
   name = next(file).strip()
   i += 1
   person_info.append(name)

   #Remaining other info
   line = next(file)
   i += 1
   while line.strip() != "":
           person_info.append(line.strip())
           line = next(file)
           i += 1

I check to see if the person I read in is the same as the last person I read in (therefore meaning that the information that follows is more account transactions for the same person). I clear the person_info variable after each iteration so that it can store the new person info on the next iteration:

if person_info != prev_person_info:  # Therefore not the same as past account holder, so add new account holder
       #Write to df new person
       df.loc[df_i] = [person_info[0], person_info[1], ' '.join(person_info[2:]), "", "", "", "", "", "",
                       "", "", ""]
       prev_person_info = person_info.copy()
       person_info.clear()
   else: #Therefore not a new person
       person_info.clear()

There are two transaction headers, one for Loans and one for deposits, in both cases the transaction header is followed by dashes:

  DATE    DESCRIPTION            DEPOSIT     WITHDRAWAL         BALANCE        
 -----------------------------------------------------------------------  

Or

 DATE    DESCRIPTION          PRINCIPAL       INTEREST         BALANCE        
 ----------------------------------------------------------------------- 

When my code comes across a line in the file with all dashes, I use the prev_line variable to determine if this is a Deposit or Loan account:

#Setup to define what type of transaction - Loan or Deposit
elif line.strip() == new_trxn_header:
   if prev_line == deposit_trxn:
       loan_dep = "DEP"
   elif prev_line == loan_trxn:
       loan_dep = "LOAN"
   else:
       loan_dep = "ISSUE WITH TRANSACTION TYPE"
       print(loan_dep+line.strip() + " Line #" + str(i))
       exit()

In terms of transactions, typically the first transaction for a new account is the balance forward line which describes the account name:

JAN 01 19 BALANCE FORWARD: MY ACCOUNT TYPE                      3,135.32

It seems that this account name can be any length but is generally proceeded by BALANCE FORWARD and ends with the balance, so I split the string to get the name of the account and I also extract the date using a regex, I then write to the df the account info, storing the account name and opening balance. I have a counter called df_i which I increment after each time I add a new line to the dataframe:

elif balance_fwd_trxn in line and \
               len(re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())) > 0:
   #Therefore a Balance Forward Transaction - will want to put name of account and write to DF

   #Parse line to get date, account number and balance
   trxn_date = re.findall(r"(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}", line)[0] #Pick first element in list in case description has a date too
   balance_values = re.findall(r"\d*\.\d+", re.sub(",", "", line))
   querywords = line.split(balance_fwd_trxn, 1)[1].replace(",", "").split()
   resultwords = [word for word in querywords if word.lower() not in balance_values]
   account_name = ' '.join(resultwords)
   if len(balance_values) > 0:
       balance = format(float(balance_values[-1]), '.2f')
   else:
       balance = ""



   #Write Account Info
   df.at[df_i, "acct_name"], df.at[df_i, "trxn_date"], df.at[df_i, "trxn_description"], df.loc[df_i]["balance"], \
   df.loc[df_i]["acct_type"] = account_name, trxn_date, "BALANCE FORWARD", balance, loan_dep

   #Fill in previous row account details when we first created the person
   if (pd.isnull(df.at[df_i, 'acct_num'])):
       df.at[df_i, 'acct_num'], df.at[df_i, 'mem_name'], df.at[df_i, 'other_mem_info'] = df.at[df_i - 1, 'acct_num'],df.at[df_i - 1, 'mem_name'], df.at[df_i-1, 'other_mem_info']

   #Increment df_i for next row in df
   df_i += 1

The last thing that gets check is if the line in the file is an actual transaction. I then have a parsing function to handle these actual transactions:

elif len(re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())) > 0 and \
       balance_fwd_trxn  not in line and i > 2:
   append_description, trxn_date, new_description, deposit, \
   withdrawal, interest, principal, balance = parse_trx(line, loan_dep, i)

   if(append_description) : #Therefore not a transaction, just a continuation of previous transaction and need to append description
       df.at[df_i-1, "trxn_description"] = df.at[df_i-1, "trxn_description"].strip() + " " + new_description
   else: #Therefore real transaction
       df.at[df_i, "trxn_date"], df.at[df_i, "trxn_description"], df.at[df_i, "deposit"], df.at[df_i, "withdrawal"], \
       df.at[df_i, "interest"], df.at[df_i, "principal"], df.loc[df_i]["balance"], df.loc[df_i]["acct_type"] = \
               trxn_date, new_description.strip(), deposit, withdrawal, interest, principal, balance, loan_dep


       if (pd.isnull(df.at[df_i, 'acct_num'])):
           df.at[df_i, 'acct_num'], df.at[df_i, 'mem_name'], df.at[df_i, 'other_mem_info'], df.at[df_i, 'acct_name']  \
               = df.at[df_i - 1, 'acct_num'], df.at[df_i - 1, 'mem_name'], df.at[df_i - 1, 'other_mem_info'], \
                 df.at[df_i - 1, 'acct_name']


       # Increment df_i for next row in df
       df_i += 1

Parsing Transaction Function:

This function returns all the parsed transactional information for a regular transaction (append_description = False) and a transaction that is just a continuation of a description (append_description = True).

def parse_trx(line_string, account_type, i):
   append_description = False
   new_description =deposit = withdrawal = interest = principal = balance = trxn_date = ""
   if(len(re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())) == 0):
       print("Not a proper transaction, line: " + str(i) + " String: " + line_string.strip())
       exit()
   else:
       trxn_date = re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())[0]

   if line_string[69] != ".": #Therefore no balance and is a description which be appended
       append_description = True
       new_description = line_string.replace(trxn_date, "").strip()



   else: #Therefore has a balance and is a real transaction
       dollar_values = re.findall("\d*\.\d+", re.sub(",", "", line_string))
       new_description = line_string.replace(trxn_date, "").strip()
       new_description = new_description.replace("-", "").strip()
       new_description = new_description.replace(",", "")
       for amount in dollar_values:
           new_description = new_description.replace(amount, "").strip()
       new_description = ' '.join(new_description.split())
       balance = format(float(dollar_values[-1]), '.2f')


       if(account_type == "DEP" and len(dollar_values) == 2):
           # Determine if Withdrawal or Deposit
           if (line_string[38] == "."): #Therefore Deposit
               deposit = format(float(dollar_values[0]), '.2f')
               withdrawal = ""
               interest = ""
               principal = ""

           elif (line_string[53] == "."): #Therefore Withdrawal
               deposit = interest = principal = ""
               withdrawal = format(float(dollar_values[0]), '.2f')
           else:
               print("Issue with transaction - not proper Deposit type - on line: " + str(i) + " String: " + line_string.strip())
               exit()
       elif(account_type == "LOAN" and len(dollar_values) == 3):
           deposit = withdrawal = ""
           interest, principal = format(float(dollar_values[0]), '.2f'), format(float(dollar_values[1]), '.2f')
       else:
           print("Issue with transaction - not proper type - on line: " + str(i) + " String: " + line_string.strip() + " type: " + account_type)
           exit()

   return append_description, trxn_date, new_description, deposit, withdrawal, interest, principal, balance

The following is the code in full:

import re
import pandas as pd
import time
import datetime as dt

def parse_trx(line_string, account_type, i):
   append_description = False
   new_description =deposit = withdrawal = interest = principal = balance = trxn_date = ""
   if(len(re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())) == 0):
       print("Not a proper transaction, line: " + str(i) + " String: " + line_string.strip())
       exit()
   else:
       trxn_date = re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())[0]

   if line_string[69] != ".": #Therefore no balance and is a description which be appended
       append_description = True
       new_description = line_string.replace(trxn_date, "").strip()



   else: #Therefore has a balance and is a real transaction
       dollar_values = re.findall("\d*\.\d+", re.sub(",", "", line_string))
       new_description = line_string.replace(trxn_date, "").strip()
       new_description = new_description.replace("-", "").strip()
       new_description = new_description.replace(",", "")
       for amount in dollar_values:
           new_description = new_description.replace(amount, "").strip()
       new_description = ' '.join(new_description.split())
       balance = format(float(dollar_values[-1]), '.2f')


       if(account_type == "DEP" and len(dollar_values) == 2):
           # Determine if Withdrawal or Deposit
           if (line_string[38] == "."): #Therefore Deposit
               deposit = format(float(dollar_values[0]), '.2f')
               withdrawal = ""
               interest = ""
               principal = ""

           elif (line_string[53] == "."): #Therefore Withdrawal
               deposit = interest = principal = ""
               withdrawal = format(float(dollar_values[0]), '.2f')
           else:
               print("Issue with transaction - not proper Deposit type - on line: " + str(i) + " String: " + line_string.strip())
               exit()
       elif(account_type == "LOAN" and len(dollar_values) == 3):
           deposit = withdrawal = ""
           interest, principal = format(float(dollar_values[0]), '.2f'), format(float(dollar_values[1]), '.2f')
       else:
           print("Issue with transaction - not proper type - on line: " + str(i) + " String: " + line_string.strip() + " type: " + account_type)
           exit()

   return append_description, trxn_date, new_description, deposit, withdrawal, interest, principal, balance

new_person_header = "ABC COMPANY"
new_trxn_header = "-----------------------------------------------------------------------"
loan_trxn, deposit_trxn, balance_fwd_trxn = "DATE    DESCRIPTION          PRINCIPAL       INTEREST         BALANCE", "DATE    DESCRIPTION            DEPOSIT     WITHDRAWAL         BALANCE", "BALANCE FORWARD:"

df = pd.DataFrame(columns=['acct_num', 'mem_name', 'other_mem_info', 'acct_name', 'acct_type', 'trxn_date', \
                          'trxn_description', 'deposit', 'withdrawal', 'principal', 'interest', 'balance'])
loan_dep = ""

with open('somefile.txt', "r") as file:
   N = len(file.readlines())
   file.seek(0,0)
   i = 1
   df_i = 0
   num_members = 0
   person_info = []
   prev_person_info = []

   person_info.clear()
   prev_person_info.clear()
   start = time.time()



   while i < N:
    # Display progress to screen
       if i%10000 == 0:
           end = time.time()
           print("Lines Parsed: " + str(i) + " percent complete " + str(i*100/N) + \
                 "% Elapsed time " + str((end - start) // 60) + " min " + str((end-start) - 60*((end - start) // 60)) + " sec")

       if i > 1:
           prev_line = line.strip()
       line = next(file)

       # New person header
       if line.strip() == new_person_header and i > 2:
           # Skip 2 blank lines after header
           blank = next(file)
           blank = next(file)
           i += 2

           # Member number stored in prev_line, followed by name and remaining other info
           person_info.append(prev_line)
           name = next(file).strip()
           i += 1
           person_info.append(name)

           #Remaining other info
           line = next(file)
           i += 1
           while line.strip() != "":
                   person_info.append(line.strip())
                   line = next(file)
                   i += 1

           if person_info != prev_person_info:  # Therefore not the same as past member, so add new member
               #Write to df new person
               df.loc[df_i] = [person_info[0], person_info[1], ' '.join(person_info[2:]), "", "", "", "", "", "",
                               "", "", ""]
               prev_person_info = person_info.copy()
               person_info.clear()
           else: #Therefore not a new person
               person_info.clear()
       #Setup to define what type of transaction - Loan or Deposit
       elif line.strip() == new_trxn_header:
           if prev_line == deposit_trxn:
               loan_dep = "DEP"
           elif prev_line == loan_trxn:
               loan_dep = "LOAN"
           else:
               loan_dep = "ISSUE WITH TRANSACTION TYPE"
               print(loan_dep+line.strip() + " Line #" + str(i))
               exit()
       elif balance_fwd_trxn in line and \
                       len(re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())) > 0:
           #Therefore a Balance Forward Transaction - will want to put name of account and write to DF

           #Parse line to get date, account number and balance
           trxn_date = re.findall(r"(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}", line)[0] #Pick first element in list in case description has a date too
           balance_values = re.findall(r"\d*\.\d+", re.sub(",", "", line))
           querywords = line.split(balance_fwd_trxn, 1)[1].replace(",", "").split()

           resultwords = [word for word in querywords if word.lower() not in balance_values]
           account_name = ' '.join(resultwords)
           if len(balance_values) > 0:
               balance = format(float(balance_values[-1]), '.2f')
           else:
               balance = ""



           #Write Account Info
           df.at[df_i, "acct_name"], df.at[df_i, "trxn_date"], df.at[df_i, "trxn_description"], df.loc[df_i]["balance"], \
           df.loc[df_i]["acct_type"] = account_name, trxn_date, "BALANCE FORWARD", balance, loan_dep

           #Fill in previous row account details when we first created the person
           if (pd.isnull(df.at[df_i, 'acct_num'])):
               df.at[df_i, 'acct_num'], df.at[df_i, 'mem_name'], df.at[df_i, 'other_mem_info'] = df.at[df_i - 1, 'acct_num'],df.at[df_i - 1, 'mem_name'], df.at[df_i-1, 'other_mem_info']

           #Increment df_i for next row in df
           df_i += 1

       elif len(re.findall(r"(?:^JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) \d{2} \d{2}",line.strip())) > 0 and \
               balance_fwd_trxn  not in line and i > 2:
           append_description, trxn_date, new_description, deposit, \
           withdrawal, interest, principal, balance = parse_trx(line, loan_dep, i)

           if(append_description) : #Therefore not a transaction, just a continuation of previous transaction and need to append description
               df.at[df_i-1, "trxn_description"] = df.at[df_i-1, "trxn_description"].strip() + " " + new_description
           else: #Therefore real transaction
               df.at[df_i, "trxn_date"], df.at[df_i, "trxn_description"], df.at[df_i, "deposit"], df.at[df_i, "withdrawal"], \
               df.at[df_i, "interest"], df.at[df_i, "principal"], df.loc[df_i]["balance"], df.loc[df_i]["acct_type"] = \
                       trxn_date, new_description.strip(), deposit, withdrawal, interest, principal, balance, loan_dep


               if (pd.isnull(df.at[df_i, 'acct_num'])):
                   df.at[df_i, 'acct_num'], df.at[df_i, 'mem_name'], df.at[df_i, 'other_mem_info'], df.at[df_i, 'acct_name']  \
                       = df.at[df_i - 1, 'acct_num'], df.at[df_i - 1, 'mem_name'], df.at[df_i - 1, 'other_mem_info'], \
                         df.at[df_i - 1, 'acct_name']


               # Increment df_i for next row in df
               df_i += 1

       i += 1

# Clean up dataframe and write to file
df = df.fillna('')
print(df.to_string())
df.to_csv('somefile_write.csv',index=False)
# Display total elapsed time
print("Lines Parsed: " + str(i) + " percent complete " + str(i*100/N) + \
                 "% Elapsed time " + str((end - start) // 60) + " min " + str((end-start) - 60*((end - start) // 60)) + " sec")

Sample File:

ABC COMPANY                                                     
JAN 01 2019 TO DEC 31 2019                                                      
000000001                                                                      
                      ABC COMPANY                               
                                                                                
                                                                                
      JOHN SMITH                                                          
      176 HUGH CRES                                                         
      NOWHERE ON                                                                 
      Z9Z 9Z9


                                 ACC: 000000001              PAGE:    1        
                                                                                
   DATE    DESCRIPTION            DEPOSIT     WITHDRAWAL         BALANCE        
 -----------------------------------------------------------------------        
                                                                                
 JAN 01 19 BALANCE FORWARD: ACCOUNT #1                            222.13        
 JAN 31 19 DIVIDEND                   .63                         222.76        
 AUG 20 19 DEPOSIT                  20.00                         242.76        
                                                                        
   DATE    DESCRIPTION            DEPOSIT     WITHDRAWAL         BALANCE        
 -----------------------------------------------------------------------        
                                                                                
 JAN 01 19 BALANCE FORWARD: ACCOUNT #2                          1,695.04        
 JAN 31 19 INTEREST                  2.16                       1,697.20                       
000000009                                                                      
                      ABC COMPANY  

      PERSON NUMBERONE                                                           
      PERSON NUMBERTWO                                                 
      1212-250 SOMEROAD RD                                                      
      SOMEPLACE AB                                                                
      Z9Z 9Z9                                                                   
                                                                                
                                                                                
                                  ACC: 000000009              PAGE:    1        
                                                                                
   DATE    DESCRIPTION            DEPOSIT     WITHDRAWAL         BALANCE        
 -----------------------------------------------------------------------        
                                                                                
 JAN 01 19 BALANCE FORWARD: ACCOUNT #1                          1,006.30               
 JUL 17 19 E/C RE JULY16/19                                                     
 JUL 31 19 INTEREST                   .18                       1,006.85               
 DEC 02 19 TO ACC # 4736                                                        
                                                                                
   DATE    DESCRIPTION            DEPOSIT     WITHDRAWAL         BALANCE        
 -----------------------------------------------------------------------        
                                                                                
 JAN 01 19 BALANCE FORWARD: ACCOUNT #2                            135.91        
 JAN 31 19 INTEREST                   .17                         136.08        
 FEB 28 19 INTEREST                   .16                         136.24                                                                                                     
000000053                                                                      
                      ABC COMPANY                               
                                                                                
                                                                                
      PERSON NUMBERTHREEE                                                              
      PERSON NUMBERFOUR                                                 
      28 SOMECOURT CRT                                                           
      SOMECITY QC                                                                
      Z9Z 9Z9                                                                   
                                                                                
                                                                                
                                  ACC: 000000053              PAGE:    5        
                                                                                

   DATE    DESCRIPTION          PRINCIPAL       INTEREST         BALANCE        
 -----------------------------------------------------------------------        
                                                                                
 JAN 01 19 BALANCE FORWARD: ACCOUNT #5                          2,538.96                
 JAN 01 19 LOAN PAYMENT            840.48-          8.52-       1,698.48                           


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source