'How can I parse CSV files with quoted fields containing commas, in awk?
I have a big CSV field, and I use awk with the field separator set to a comma. However, some fields are quoted and contain a comma, and I'm facing this issue:
Original file:
Downloads $ cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"
I am trying this way:
Downloads $ cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd
Expecting result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Solution 1:[1]
I would use a tool that is able to properly parse CSV, such as xsv. With it, the command would look like
$ xsv select 2-4 testfile.csv
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"
or, if you really want every value quoted, with a second step:
$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Solution 2:[2]
Include (escaped) quotes in your field separator flag, and add them to your output print fields:
testfile.csv | awk -F "\",\"" '{print "\""$2"\",\""$3"\",\""$4}'
output:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Solution 3:[3]
If gawk or GNU awk is available, you can make use of FPAT, which matches the fields, instead of splitting on field separators.
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, '{print $2, $3, $4}' testfile.csv
Result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
The string ([^,]+)|(\"[^\"]+\") is a regex pattern which matches either of:
([^,]+)... matches a sequence of any characters other than a comma.(\"[^\"]+\")... matches a string enclosed by double quotes (which may include commas in between).
The parentheses around the patterns are put for visual clarity purpose and the regex will work without them such as FPAT='[^,]+|\"[^\"]+\"' because the alternative | has lower precedence.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Benjamin W. |
| Solution 2 | |
| Solution 3 | tshiono |
