'Group and Sum Multiple Columns without Pandas

I have a list that contains multiple columns, and I need to group and sum rows based on two columns. Can I do this without using a Pandas dataframe?

I have a dataset in a list like this:

User   Days  Project
Dave   3     Red
Dave   4     Red
Dave   2     Blue
Sue    4     Red
Sue    1     Red
Sue    3     Yellow

Specifically: [[Dave, 3, Red], [Dave, 4, Red], [Dave, 2, Blue], [Sue, 4, Red], [Sue, 1, Red], [Sue, 3, Yellow]]

What I want to do is output on the same line some totals like this:

User   Days  Project   UserDays  ProjectDaysPerUser
Dave   3     Red       9              7
Dave   4     Red       9              7
Dave   2     Blue      9              2
Sue    4     Red       8              5
Sue    1     Red       8              5
Sue    3     Yellow    8              3

So I'm trying to group twice to get the "ProjectDaysPerUser", first by user, then by project. It's this double grouping that's throwing me off.

Is there an easy way to do this without creating a Panda dataframe?



Solution 1:[1]

use dictionary for improved performance

data = [['Dave', 3, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Dave', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
sum_dict = {}
for d in data:
    sum_dict[d[0]] = sum_dict.get(d[0], 0) + d[1]
    sum_dict[(d[0], d[2])] = sum_dict.get((d[0], d[2]), 0) + d[1]

for d in data:
    d.append(sum_dict[d[0]])
    d.append(sum_dict[(d[0], d[2])])
    print(d)

Solution 2:[2]

Because you're doing sums, this can also be solved nicely with collections.Counter:

from collections import Counter

data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]


user_days = Counter()
project_user_days = Counter()

for (name, num_days, project) in data:
    user_days[name] += num_days
    project_user_days[(name, project)] += num_days

derived_data = [
    [name, num_days, project, user_days[name], project_user_days[(name, project)]]
    for (name, num_days, project) in data
]

import pprint
pprint.pprint(derived_data)

# [['Dave', 3, 'Red', 9, 7],
#  ['Dave', 4, 'Red', 9, 7],
#  ['Dave', 2, 'Blue', 9, 2],
#  ['Sue', 4, 'Red', 8, 5],
#  ['Sue', 1, 'Red', 8, 5],
#  ['Sue', 3, 'Yellow', 8, 3]]

Solution 3:[3]

Efficient Code

import itertools

def group_data(input1:list)->list:
    name_dict = {k : sum(v[1] for v in g) for k, g in itertools.groupby(sorted(input1, key=lambda x:x[0]), key=lambda x:x[0])}
    name_colour_dict = {k: sum(v[1] for v in g) for k,g in itertools.groupby(sorted(input1, key=lambda x:(x[0], x[2])), key=lambda x:(x[0],x[2]))}

    for row in input1:
        name = row[0]
        name_colour = (row[0], row[2])
        row.append(name_dict[name])
        row.append(name_colour_dict[name_colour])

    print(input1)

group_data([['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]

)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 H R
Solution 2 Ben
Solution 3 Shwetha Jog