'Remove duplicates rows from large csv with multiline fields
Given the following csv, with multiline fields:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"
... which displays as:
| id | text |
|---|---|
| 1 | line 1 line 2 |
| 2 | line 1 line 2 |
| 1 | line 1 line 2 |
If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):
awk -F, '!x[$1]++' 'file-01.csv' > 'file-01-deduped.csv'
I end up with:
"id","text"
"1","line 1
line 2"
"2","line 1
which displays as:
| id | text |
|---|---|
| 1 | line 1 line 2 |
| 2 | line 1 |
This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.
Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.
Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.
I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!
Solution 1:[1]
As others have pointed out, you need a CSV-aware tool to properly handle the line breaks inside the rows.
GoCSV was made for this: it's fast, pretty good w/memory, is CSV savvy, and is pre-built for a number of platforms.
Its unique subcommand will keep only the first row based on the occurrence of a value or set of values in a column or set of columns.
To drop duplicate rows based on the text column:
gocsv unique -c 'text' input.csv > de-duped.csv
It can even tell you how many dupes it found along the way:
gocsv unique -c 'text' -count input.csv > de-duped.csv
How fast, how good w/memory?.
I mocked up a 1_000_000 row CSV with two columns of random text and embedded line breaks (also includes commas and quoted quotes):
ll -h gen_1000000x3.csv
-rw-r--r-- 1 zyoung staff 52M Apr 26 09:36 gen_1000000x3.csv
cat gen_1000000x3.csv
ID,Col1,Col2
0,"ddddd
"","" oooooo","wwwwww
"","" nnnnnnn"
1,"llllllll
"","" ccccccc","iiiiiiii
"","" wwwww"
2,"nnnnn
"","" iiiiiiii","ooooo
"","" kkkkkkkk"
...
On my M1 MacBook Air, de-duping the 1-million row, 52 MB CSV took a half-second and consumed only 13 MB of memory:
/usr/bin/time -l gocsv unique -c Col2 gen_1000000x3.csv > de-duped.csv
0.45 real 0.49 user 0.05 sys
...
13124608 peak memory footprint
Over 989_000 duplicate rows were dropped:
gocsv dims de-duped.csv
Dimensions:
Rows: 10816
Columns: 3
We can count instances of each of value in Col2 that was found (counting consumed 175 MB of memory):
gocsv unique -c Col2 -count gen_1000000x3.csv > de-duped.csv
GoCSV can also display multi-line rows in the terminal:
+--------+---------------+---------------+-------+
| ID | Col1 | Col2 | Count |
+--------+---------------+---------------+-------+
| 0 | ddddd | wwwwww | 80 |
| | "," oooooo | "," nnnnnnn | |
+--------+---------------+---------------+-------+
| 1 | llllllll | iiiiiiii | 89 |
| | "," ccccccc | "," wwwww | |
+--------+---------------+---------------+-------+
| 2 | nnnnn | ooooo | 97 |
| | "," iiiiiiii | "," kkkkkkkk | |
...
I cannot compare the awk scripts suggested so far: one just doesn't do anything in my terminal, and the other requires GNU which I don't have. But awk will come out slower: 3x longer just to run awk '{print $0}' gen_1000000x3.csv > /dev/null, and that's not even doing meaningful work. And the hoops you have to jump through to try and program a CSV parser from scratch.
Solution 2:[2]
With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk -F',' '
FNR>1{
sub(/^"/,"",$2)
sub(/"$/,"",$3)
gsub(/"/,"",$1)
print $1 OFS $2 ORS " " $3
}
' <(awk '{printf("%s%s",$0!~/^"/?",":FNR>1?ORS:"",$0)} END{print ""}' Input_file)
Explanation: Simple explanation would be, running 1st awk to print all lines in single row(wherever it has line not started from ") and sending its output as an input to main awk where printing the required values of id and all line values as per requirement.
Solution 3:[3]
Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.
You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:
$ gawk '
BEGIN {
RS="^$" # read in whole file
FPAT="([^,\n]*)|(\"(\"\"|[^\"])+\")" # regex magic
OFS=","
}
{
for(i=1;i<NF;i+=2) # iterate fields 2 at a time
if(!a[$i]++) # if first field not seen before
print $i,$(i+1) # output 2 fields
}' file
Test data:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"
Output:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
I don't know how many ways it can fail you, tho.
Solution 4:[4]
A great and very simple tool CSV aware is Miller. Running
mlr --csv uniq -a input.csv >output.csv
You will have
id,text
1,"line 1
line 2"
2,"line 1
line 2"
It has also a great documentation: this is the one for the uniq verb.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | James Brown |
| Solution 4 | aborruso |
