'How to group columns and sum them, in a large CSV?

I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.

My CSV is similar to:

    ID Location        Date  Value
 1   1     Loc1  2022-01-27      5
 2   1     Loc1  2022-01-27      4
 3   1     Loc1  2022-01-28      7
 4   1     Loc2  2022-01-29      8
 5   2     Loc1  2022-01-27     11
 6   2     Loc2  2022-01-28      4
 7   2     Loc2  2022-01-29      6
 8   3     Loc1  2022-01-28      9
 9   3     Loc1  2022-01-28      9
10   3     Loc2  2022-01-29      1
  • {ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
  • {ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18

Here's what that sample input should look like, processed/summed, and written to a new CSV:

ID Location        Date  Value
1     Loc1  2022-01-27      9
1     Loc1  2022-01-28      7
1     Loc2  2022-01-29      8
2     Loc1  2022-01-27     11
2     Loc2  2022-01-28      4
2     Loc2  2022-01-29      6
3     Loc1  2022-01-28     18
3     Loc2  2022-01-29      1

I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!

NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)



Solution 1:[1]

If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.

Here you could use:

with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
    rd, wr = csv.reader(fd), csv.writer(fdout)
    _ = wr.writerow(next(rd))      # header line
    old = [None]*4
    for row in rd:
        row[3] = int(row[3])       # convert value field to integer
        if row[:3] == old[:3]:
            old[3] += row[3]       # concatenate values of similar rows     
        else:
            if old[0]:             # and write the concatenated row
                _ = wr.writerow(old)
            old = row
    if old[0]:                     # do not forget the last row...
        _ = wr.writerow(old)

With the shown input data, it gives as expected:

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.

Solution 2:[2]

You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:

from collections import Counter
import csv

data = Counter()

with open('input.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    
    for row in csv_input:
        data[tuple(row[:3])] += int(row[3])

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for key, value in data.items():
        csv_output.writerow([*key, value])

Giving the output:

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

This avoids storing the input CSV in memory, only the output CSV data.


If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:

from collections import Counter
import csv

def write_id(csv_output, data):
    for key, value in data.items():
        csv_output.writerow([*key, value])
    data.clear()


data = Counter()
current_id = None

with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    
    header = next(csv_input)
    csv_output.writerow(header)
    
    for row in csv_input:
        if current_id and row[0] != current_id:
            write_id(csv_output, data)
            
        data[tuple(row[:3])] += int(row[3])
        current_id = row[0]
        
    write_id(csv_output, data)        

For the given example, this would give the same output.

Solution 3:[3]

Have you tried:

output = []
for key, group in df.groupby([columns]):
    output.append((key, group['a'].sum()))

pd.DataFrame(output).to_csv("....csv")

source: https://stackoverflow.com/a/54244289/7132906

Solution 4:[4]

There are a number of answers already that may suffice: @MartinEvans and @Corralien both recommend breaking-up/chunking the input-output. I'm especially curious if @MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).

If either of those don't work, I think you'll be faced with the question:

What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?

In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).

This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):

import csv
from collections import Counter
from datetime import datetime


# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)

def write_row(row):
    global writer
    writer.writerow(row)


# Don't let counter get too big (for memory)
def flush_counter(counter):
    for key, sum_ in counter.items():
        id_, loc, date = key
        write_row([id_, loc, date, sum_])


# You said "already grouped by week-number", so:
# -   read and sum your input CSV in chunks of "week (number) groups"
# -   once the reader reads past a week-group, it concludes week-group is finished
#     and flushes the counts for that week-group

last_wk_group = None
counter = Counter()

# Open input
with open('input.csv', newline='') as f:
    reader = csv.reader(f)

    # Copy header
    header = next(reader)
    write_row(header)

    for row in reader:
        # Get "base" values
        id_, loc, date = row[0:3]
        value = int(row[3])

        # 2022-01-27  ->  2022-04
        wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')

        # Decide if last week-group has passed
        if wk_group != last_wk_group:
            flush_counter(counter)
            counter = Counter()
            last_wk_group = wk_group

        # Count/sum this week-groups
        key = tuple([id_, loc, date_])
        counter[key] += value


# Flush remaining week-group counts
flush_counter(counter)

As a basic test, I moved the first row of your sample input to the last row, like @Corralien was asking:

ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4

and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):

ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Serge Ballesta
Solution 2
Solution 3 Emi OB
Solution 4