'quotechar is not functioning in pandas read_fwf
I have comma(s) , inside pandas dataframe column values like below. I have tried all parameters provided by pandas read_fwf and read_csv methods but nothing seems to be working.
Note: I'm ready txt.gz file and I cannot use StringIO
Sample Input:
"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7
My code
pd.read_fwf(input_filepath,compression='gzip',sep=',',quoting=csv.QUOTE_ALL,skipinitialspace=True,header=None,nrows=1000, usecols=[0,1,7])
Expected Output
0 1 2 3 4 5 6
0 NTE| NaN| NaN| NANTONG JIACHENG GARMENTS CO. LTD | P| 0| 1
1 LEH| NaN| NaN| WUHAN YIZHAO TRADING CO. LTD. | P| 0| 2
2 ARN| NaN| NaN| Clinical Diagnostic Solutions Inc.| P| 0| 7
sample data file added data_file
I only consider usecols=[0,1,7] for the processing.
I'm looking for pandas way of resolving it instead of using regex as there is a lot of such commas in my data. Please help me on this issue.
Solution 1:[1]
After an extensive chat:
import io
import re
import pandas as pd
import tarfile
#Working on the gunzipped and untarred txt file
f = 'RequestoEAP_20220220_test.txt'
s1 = list()
s2 = list()
re_brackets = re.compile('\((.*)\)')
with open(f, 'r') as f:
for l in f:
s1.append(l[:23])
if (m := re_brackets.search(l)):
s2.append(m[1])
else:
s2.append('')
df1 = pd.read_csv(io.StringIO('\n'.join(s1)), sep=' ', header=None)
df2 = pd.read_csv(io.StringIO('\n'.join(s2)), sep=',', header=None)
df = pd.concat((df1, df2), axis=1)
print(df)
Output:
0 1 0 1 2 3 4 5 6 ... 24 25 26 27 28 29 30 31 32
0 2022-02-20 00:00:10.061 6016293021 JKT ID AP SUB ID AP ... N 0 1 FOCIDIGTW NaN NaN jms:WebSphere_MQ-default-sender)................. main True
1 2022-02-20 00:00:10.061 8910112455 BJX MX AM VSA MX AM ... 5 0 1 980446133 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
2 2022-02-20 00:00:10.061 9640651705 NLU MX AM MTY MX AM ... G 0 1 988122138 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
3 2022-02-20 00:00:10.061 9410678701 JKT ID AP SRG ID AP ... N 0 1 FOCIDIGTW NaN NaN jms:WebSphere_MQ-default-sender)................. main True
4 2022-02-20 00:00:10.061 7120027014 BOM IN AP CPH DK EU ... D 0 1 530852429 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
5 2022-02-20 00:00:10.062 9473172225 LCY GB EU ZLS GB EU ... N 0 1 135104716 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
6 2021-11-09 00:00:00.200 5988409265 PVG CN CN NTE FR EU ... P 0 1 969762616 NaN NTE jms:WebSphere_MQ-default-sender NaN NaN
7 2021-11-09 00:00:00.202 9876963305 SZX CN CN LEH FR EU ... P 0 2 606734345 NaN NaN jms:WebSphere_MQ-default-sender NaN NaN
8 2021-11-09 00:00:00.292 5697446005 TMB US AM ARN SE EU ... P 0 7 962003035 NaN STO jms:WebSphere_MQ-default-sender NaN NaN
Before the extensive chat:
import io
s = """"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7
"""
pd.read_csv(io.StringIO(s), sep=',', header=None)
You had the wrong separator and you specified compression, which all don't seem to be necessary. Output:
0 1 2 3 4 5 6
0 NTE NaN NaN NANTONG JIACHENG GARMENTS CO., LTD P 0 1
1 LEH NaN NaN WUHAN YIZHAO TRADING CO. , LTD. P 0 2
2 ARN NaN NaN Clinical Diagnostic Solutions, Inc. P 0 7
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
