'How to parse out a dataframe from .csv file. which contains header detail rows using Python

I have a file I am trying to extract values from to create a data frame. I have tried a regex approach to create lists from the file, but data format(Header/H and Detail/D) as below is giving me inconsistent row counts when I input the resulting lists into a data frame. I think the issue is that some records have 1 detail (D) row while others have more than 1 (D) row. Could you suggest another approach? I was thinking of trying to create a dictionary object where each H row would be the key and each D row would be the value, using a for loop of some kind.

The file format is as below:

H, INV34801, 20200201, 09:18:55, IN, 5 D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55,
H, INV34802, 20200201, 10:12:35, IN, 5 D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35,
D, INV34802, , 1.00, IN, 20200201, 10:12:35,

This the code I have been trying:

    import pandas as pd
    import re
    import itertools
    #First I extract the date that each sale took place. 
    lst1= [line for line in re.findall(('[IN, ]\d\d\d\d\d\d\d\d'), contents)]
    #Now I remove every alternate date to remove the duplicate date I can confirm seeing that the                                 
    #date column has the same number of rows as the Invoice Number column
    lst1=lst1[1::2]
    #Now I extract the invoice number
    lst2= [line for line in re.findall("INV\w*",contents)]
    # Now I extract the product codes
    lst3=[line for line in         
    re.findall(('\s\s\s\s\s\w\w\w\w\w\w\w\w\w\w\w|\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s|\n
    \s\s\s\s\s\s\s\s\s\s\s\s\d\d\d\d|\s\s\s\s\s\s\s\s\s\s\d\d\d\d\d\d'),contents)]
    #Now I extract the Quantity Sold 
    lst4=[line for line in re.findall(('\s\s\s\s\s\s\d\.\d\d'),contents)]
    #then I create a column from the list of Invoice numbers
    df=pd.DataFrame([lst1,lst2,lst3,lst4])
    df =df.transpose()
    df.columns=['Date','Invoice_Number','Product_Code','Quantity']
    print(df)
    ''' 

The output structure I get is correct but the quantity and product codes arent lined up to the correct invoice numbers.

Dataframe below:

    Date Invoice_Number      Product_Code    Quantity
    0      20200201       INV34801                          1.00
    1      20200201       INV34802                          1.00
    2      20200201       INV34803                          1.00
    3      20200201       INV34804                          1.00
    4      20200201       INV34805                          8.00

I'd appreciate your kind advice.



Solution 1:[1]

Try this:

regex = r"[H,D] (?P<invoice_nr>[^,]*)(, (?P<date>[^,]*)[\s\S]*?(?P<quantity>-?\d+\.00), IN)[\s\S]*?(\n|$)"

I have made you an example here, not sure if this helps, but it should give you some pointers on regex

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1