'quotechar is not functioning in pandas read_fwf

I have comma(s) , inside pandas dataframe column values like below. I have tried all parameters provided by pandas read_fwf and read_csv methods but nothing seems to be working.

Note: I'm ready txt.gz file and I cannot use StringIO

Sample Input:

"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7

My code

pd.read_fwf(input_filepath,compression='gzip',sep=',',quoting=csv.QUOTE_ALL,skipinitialspace=True,header=None,nrows=1000, usecols=[0,1,7])

Expected Output

      0   1   2                                    3    4   5   6
0   NTE| NaN| NaN| NANTONG JIACHENG GARMENTS CO. LTD |  P|  0|  1
1   LEH| NaN| NaN| WUHAN YIZHAO TRADING CO. LTD.     |  P|  0|  2
2   ARN| NaN| NaN| Clinical Diagnostic Solutions Inc.|  P|  0|  7

sample data file added data_file

I only consider usecols=[0,1,7] for the processing.

I'm looking for pandas way of resolving it instead of using regex as there is a lot of such commas in my data. Please help me on this issue.



Solution 1:[1]

After an extensive chat:

import io
import re
import pandas as pd
import tarfile

#Working on the gunzipped and untarred txt file
f = 'RequestoEAP_20220220_test.txt'
s1 = list()
s2 = list()

re_brackets = re.compile('\((.*)\)')

with open(f, 'r') as f:
    for l in f:
        s1.append(l[:23])
        if (m := re_brackets.search(l)):
            s2.append(m[1])
        else:
            s2.append('')

df1 = pd.read_csv(io.StringIO('\n'.join(s1)), sep=' ', header=None)
df2 = pd.read_csv(io.StringIO('\n'.join(s2)), sep=',', header=None)

df = pd.concat((df1, df2), axis=1)
print(df)

Output:

           0             1           0    1   2   3    4   5   6   ... 24  25  26         27  28   29                                                 30    31    32
0  2022-02-20  00:00:10.061  6016293021  JKT  ID  AP  SUB  ID  AP  ...  N   0   1  FOCIDIGTW NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
1  2022-02-20  00:00:10.061  8910112455  BJX  MX  AM  VSA  MX  AM  ...  5   0   1  980446133 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
2  2022-02-20  00:00:10.061  9640651705  NLU  MX  AM  MTY  MX  AM  ...  G   0   1  988122138 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
3  2022-02-20  00:00:10.061  9410678701  JKT  ID  AP  SRG  ID  AP  ...  N   0   1  FOCIDIGTW NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
4  2022-02-20  00:00:10.061  7120027014  BOM  IN  AP  CPH  DK  EU  ...  D   0   1  530852429 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
5  2022-02-20  00:00:10.062  9473172225  LCY  GB  EU  ZLS  GB  EU  ...  N   0   1  135104716 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
6  2021-11-09  00:00:00.200  5988409265  PVG  CN  CN  NTE  FR  EU  ...  P   0   1  969762616 NaN  NTE                    jms:WebSphere_MQ-default-sender   NaN   NaN
7  2021-11-09  00:00:00.202  9876963305  SZX  CN  CN  LEH  FR  EU  ...  P   0   2  606734345 NaN  NaN                    jms:WebSphere_MQ-default-sender   NaN   NaN
8  2021-11-09  00:00:00.292  5697446005  TMB  US  AM  ARN  SE  EU  ...  P   0   7  962003035 NaN  STO                    jms:WebSphere_MQ-default-sender   NaN   NaN

Before the extensive chat:

import io
s = """"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7
"""
pd.read_csv(io.StringIO(s), sep=',', header=None)

You had the wrong separator and you specified compression, which all don't seem to be necessary. Output:

     0   1   2                                    3  4  5  6
0  NTE NaN NaN   NANTONG JIACHENG GARMENTS CO., LTD  P  0  1
1  LEH NaN NaN      WUHAN YIZHAO TRADING CO. , LTD.  P  0  2
2  ARN NaN NaN  Clinical Diagnostic Solutions, Inc.  P  0  7

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1