'How to parse out a dataframe from .csv file. which contains header detail rows using Python
I have a file I am trying to extract values from to create a data frame. I have tried a regex approach to create lists from the file, but data format(Header/H and Detail/D) as below is giving me inconsistent row counts when I input the resulting lists into a data frame. I think the issue is that some records have 1 detail (D) row while others have more than 1 (D) row. Could you suggest another approach? I was thinking of trying to create a dictionary object where each H row would be the key and each D row would be the value, using a for loop of some kind.
The file format is as below:
H, INV34801, 20200201, 09:18:55, IN, 5
D, INV34801, 0053, 1.00, IN, 20200201, 09:18:55,
H, INV34802, 20200201, 10:12:35, IN, 5
D, INV34802, D22345433DU, -1.00, IN, 20200201, 10:12:35,
D, INV34802, , 1.00, IN, 20200201, 10:12:35,
This the code I have been trying:
import pandas as pd
import re
import itertools
#First I extract the date that each sale took place.
lst1= [line for line in re.findall(('[IN, ]\d\d\d\d\d\d\d\d'), contents)]
#Now I remove every alternate date to remove the duplicate date I can confirm seeing that the
#date column has the same number of rows as the Invoice Number column
lst1=lst1[1::2]
#Now I extract the invoice number
lst2= [line for line in re.findall("INV\w*",contents)]
# Now I extract the product codes
lst3=[line for line in
re.findall(('\s\s\s\s\s\w\w\w\w\w\w\w\w\w\w\w|\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s|\n
\s\s\s\s\s\s\s\s\s\s\s\s\d\d\d\d|\s\s\s\s\s\s\s\s\s\s\d\d\d\d\d\d'),contents)]
#Now I extract the Quantity Sold
lst4=[line for line in re.findall(('\s\s\s\s\s\s\d\.\d\d'),contents)]
#then I create a column from the list of Invoice numbers
df=pd.DataFrame([lst1,lst2,lst3,lst4])
df =df.transpose()
df.columns=['Date','Invoice_Number','Product_Code','Quantity']
print(df)
'''
The output structure I get is correct but the quantity and product codes arent lined up to the correct invoice numbers.
Dataframe below:
Date Invoice_Number Product_Code Quantity
0 20200201 INV34801 1.00
1 20200201 INV34802 1.00
2 20200201 INV34803 1.00
3 20200201 INV34804 1.00
4 20200201 INV34805 8.00
I'd appreciate your kind advice.
Solution 1:[1]
Try this:
regex = r"[H,D] (?P<invoice_nr>[^,]*)(, (?P<date>[^,]*)[\s\S]*?(?P<quantity>-?\d+\.00), IN)[\s\S]*?(\n|$)"
I have made you an example here, not sure if this helps, but it should give you some pointers on regex
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
