'Remove duplicates rows from large csv with multiline fields

Given the following csv, with multiline fields:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"

... which displays as:

id text
1 line 1
line 2
2 line 1
line 2
1 line 1
line 2

If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):

awk -F, '!x[$1]++' 'file-01.csv' > 'file-01-deduped.csv'

I end up with:

"id","text"
"1","line 1
line 2"
"2","line 1

which displays as:

id text
1 line 1
line 2
2 line 1

This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.

Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.

Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.

I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!



Solution 1:[1]

As others have pointed out, you need a CSV-aware tool to properly handle the line breaks inside the rows.

GoCSV was made for this: it's fast, pretty good w/memory, is CSV savvy, and is pre-built for a number of platforms.

Its unique subcommand will keep only the first row based on the occurrence of a value or set of values in a column or set of columns.

To drop duplicate rows based on the text column:

gocsv unique -c 'text' input.csv > de-duped.csv

It can even tell you how many dupes it found along the way:

gocsv unique -c 'text' -count input.csv > de-duped.csv

How fast, how good w/memory?.

I mocked up a 1_000_000 row CSV with two columns of random text and embedded line breaks (also includes commas and quoted quotes):

ll -h gen_1000000x3.csv
-rw-r--r--  1 zyoung  staff    52M Apr 26 09:36 gen_1000000x3.csv

cat gen_1000000x3.csv
ID,Col1,Col2
0,"ddddd
 "","" oooooo","wwwwww
 "","" nnnnnnn"
1,"llllllll
 "","" ccccccc","iiiiiiii
 "","" wwwww"
2,"nnnnn
 "","" iiiiiiii","ooooo
 "","" kkkkkkkk"
...

On my M1 MacBook Air, de-duping the 1-million row, 52 MB CSV took a half-second and consumed only 13 MB of memory:

/usr/bin/time -l gocsv unique -c Col2 gen_1000000x3.csv  > de-duped.csv       
        0.45 real         0.49 user         0.05 sys
            ...
            13124608  peak memory footprint

Over 989_000 duplicate rows were dropped:

gocsv dims de-duped.csv 
Dimensions:
  Rows: 10816
  Columns: 3

We can count instances of each of value in Col2 that was found (counting consumed 175 MB of memory):

gocsv unique -c Col2 -count gen_1000000x3.csv  > de-duped.csv

GoCSV can also display multi-line rows in the terminal:

+--------+---------------+---------------+-------+
| ID     | Col1          | Col2          | Count |
+--------+---------------+---------------+-------+
| 0      | ddddd         | wwwwww        | 80    |
|        |  "," oooooo   |  "," nnnnnnn  |       |
+--------+---------------+---------------+-------+
| 1      | llllllll      | iiiiiiii      | 89    |
|        |  "," ccccccc  |  "," wwwww    |       |
+--------+---------------+---------------+-------+
| 2      | nnnnn         | ooooo         | 97    |
|        |  "," iiiiiiii |  "," kkkkkkkk |       |
...

I cannot compare the awk scripts suggested so far: one just doesn't do anything in my terminal, and the other requires GNU which I don't have. But awk will come out slower: 3x longer just to run awk '{print $0}' gen_1000000x3.csv > /dev/null, and that's not even doing meaningful work. And the hoops you have to jump through to try and program a CSV parser from scratch.

Solution 2:[2]

With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk.

awk -F',' '
FNR>1{
  sub(/^"/,"",$2)
  sub(/"$/,"",$3)
  gsub(/"/,"",$1)
  print $1 OFS $2 ORS "  " $3
}
' <(awk '{printf("%s%s",$0!~/^"/?",":FNR>1?ORS:"",$0)} END{print ""}' Input_file)

Explanation: Simple explanation would be, running 1st awk to print all lines in single row(wherever it has line not started from ") and sending its output as an input to main awk where printing the required values of id and all line values as per requirement.

Solution 3:[3]

Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.

You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:

$ gawk '
BEGIN {
    RS="^$"                                # read in whole file
    FPAT="([^,\n]*)|(\"(\"\"|[^\"])+\")"   # regex magic
    OFS=","
}
{
    for(i=1;i<NF;i+=2)                     # iterate fields 2 at a time
        if(!a[$i]++)                       # if first field not seen before
            print $i,$(i+1)                # output 2 fields
}' file

Test data:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"

Output:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"

I don't know how many ways it can fail you, tho.

Solution 4:[4]

A great and very simple tool CSV aware is Miller. Running

mlr --csv uniq -a input.csv >output.csv

You will have

id,text
1,"line 1
line 2"
2,"line 1
line 2"

It has also a great documentation: this is the one for the uniq verb.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 James Brown
Solution 4 aborruso