'Group a C# collection by inequality with a maximum group size using LINQ
Problem
I am trying to take a collection of objects and split it into multiple sub collections (using LINQ) where:
- No sub collection contains two objects that are equal according to some arbitrary criteria (in the example it's that no two objects can have the same GroupId).
- No sub collection has more than a maximum number of elements.
- The number of sub collections should be minimized (each should be as full as possible while still satisfying 1 and 2).
What have you tried?
So far I have tried using LINQ's Aggregate function, as well as MoreLinq's Batch, Partition, DictinctBy, and some other functions. None of them appear to do quite what I want though. I also have a working solution that does not use LINQ exclusively. I am hoping there is a fairly simple solution leveraging LINQ.
Code sample
Here is an example that does not use LINQ.
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
var records = new List<Foo>() {
new Foo(1, 1),
new Foo(2, 1),
new Foo(3, 1),
new Foo(4, 2),
new Foo(5, 2),
new Foo(6, 3),
new Foo(7, 4),
new Foo(8, 5),
new Foo(9, 6),
new Foo(10, 7),
};
var groups = Group(records, 3);
for (int i = 0; i < groups.Count(); ++i) {
System.Console.WriteLine($"Group {i}: {Newtonsoft.Json.JsonConvert.SerializeObject(groups[i])}");
}
}
public class Foo {
public int Id;
public int GroupId;
public Foo(int id, int groupId) {
this.Id = id;
this.GroupId = groupId;
}
}
public static List<List<Foo>> Group(IEnumerable<Foo> records, int maxGroupSize)
{
var rv = new List<List<Foo>>();
foreach (var record in records) {
bool added = false;
for (int i = 0; i < rv.Count; ++i) {
if (rv[i].Count < maxGroupSize && !rv[i].Any(a => a.GroupId == record.GroupId)) {
rv[i].Add(record);
added = true;
break;
}
}
if (!added) {
rv.Add(new List<Foo>() { record });
}
}
return rv;
}
}
I am not looking for just a more efficient implementation of the above. This is just an example of a solution. I am looking for a solution that leverages LINQ.
Solution 1:[1]
Yes, it's doable with pure LINQ.
Without third requirement it's pretty trivial, but that one throws in a quirk. The code below achieves all three requirements, AFAICT, and demonstration is at https://dotnetfiddle.net/hwECKG.
The comments are in the code that explain how the result is achieved.
There may be opportunities for optimization, but that's beyond the scope of the question.
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
var records = new List<Foo>() {
new Foo(1, 1),
new Foo(2, 1),
new Foo(3, 1),
new Foo(4, 2),
new Foo(5, 2),
new Foo(6, 3),
new Foo(7, 4),
new Foo(8, 5),
new Foo(9, 6),
new Foo(10, 7),
};
const int maxPerGroup = 3;
var groups = records
// Find and group all the Foo's that cannot be in the same collection.
.GroupBy(r => r.GroupId)
// For each Foo with the same Group id, assign an index by its position.
// This will be used in the next step to assign same-groupid foos to different buckets.
.SelectMany(g => g.Select((r, i) => new {
Bucket = i,
Record = r
}))
// As mentioned above, assign phobic Foos to different buckets.
.GroupBy(g => g.Bucket, p => p.Record)
// Below is not *technically* necessary if `records` enumerable can be enumerated multiple times,
// but I don't make that assumption. This converts items to an array so we can enumerate it more than once below.
.Select(g => g.ToArray())
// Now take each bucket and split it into members that fit and overflows, if any.
.Select(b => new {
Members = b.Take(maxPerGroup),
Overflows = b.Skip(maxPerGroup)
})
// Via aggregation, go thru all buckets keeping them, but picking off their overflows into one signle overflow collection.
// Further processing is actually nexted in projection argument of the aggregate since
// aggregate wants to return a single item, and that's not what we need.
.Aggregate(
new { Groups = new List<IEnumerable<Foo>>(), Overflows = new Queue<Foo>()},
(acc, next) => {
acc.Groups.Add(next.Members);
foreach(var o in next.Overflows)
acc.Overflows.Enqueue(o);
return acc;
},
// Here we have all buckets and a single overflow queue, now go thru each bucket and fill it up from overflow if it has any space left.
t => t.Groups.Aggregate(
new { Groups = new List<IEnumerable<Foo>>(), Overflows = t.Overflows },
(dist, next) => {
var currentGroupMembers = next.ToList();
while (currentGroupMembers.Count() < maxPerGroup && dist.Overflows.Count > 0)
currentGroupMembers.Add(dist.Overflows.Dequeue());
dist.Groups.Add(currentGroupMembers);
return dist;
},
// And by the time we get here, we filled all existing buckets, but may have things left in overflow.
// So just chunk up overflow and add it to our list of buckets.
t2 => {
t2.Groups.AddRange(t2.Overflows.Chunk(maxPerGroup));
// And that returns the final result.
return t2.Groups;
}
)
)
// This is not really necessary, but helps to see in memory with debugging.
.ToArray();
for (int i = 0; i < groups.Count(); ++i) {
System.Console.WriteLine($"Group {i}: {Newtonsoft.Json.JsonConvert.SerializeObject(groups[i])}");
}
}
public class Foo {
public int Id;
public int GroupId;
public Foo(int id, int groupId) {
this.Id = id;
this.GroupId = groupId;
}
}
}
EDIT:
I should've added the following when posting it. I tried to stay as close as possible to a reasonable interpretation of "using LINQ", meaning try to do the whole thing in a one whooping continuous LINQ chain, and doing (nearly, exception is couple of foreach loops that could be avoided, but in too painful way) all computation using LINQ functions.
But it can be simplified.
The greatest scary thing there is the call to .Aggregate, which is efficient, but scary-looking due to having to nest it inside the projection argument. Quite easier would be to simply take the first .Aggregate without projection argument, and save return to a local variable. The return there would be a single object containing Groups and single list of Overflows. Then, you can just call .GetEnumerator() on overflow, and then a simple loop can easily go thru each group and "stuff it up" by consuming enumerator, and then the chunk if necessary. That code would look shorter (albeit doing about the same thing), but might not meet OP's requirement of it being "LINQ".
Other simplifications are possible by creating LINQ extension methods to do some custom steps easily that vanilla LINQ makes it tougher. For challenge purposes I didn't go that way, but from own experience, that can help to simplify code as well (and still be "LINQ").
Also, I should note that my demo is not meant to show "most efficient use of LINQ", but rather taking on a challenge of doing that in LINQ (that I knew is possible). This code can probably be simplified, and I welcome anyone to take a shot at that -- I just don't have time to also make the answer solution more "elegant."
But to the comments that it's too complicated -- it depends. We don't know why OP wants it in LINQ. LINQ tends to get more efficient as it tends to avoid extra copying and moving of data. Or maybe OP wants to plug it in into a larger LINQ-based infrastructure, and thus wants to get an IEnumerable out for more processing in the LINQ infrastructure. I once worked on reporting-like problem where LINQ chain was generated programatically based on parameters, and resulted in 25+ transformations before it got to final answer, but it worked efficiently minimizing data copy operations and allowing it to plug in into a generic consumer of the data.
UPDATE 2
So here's a shorter version, but is a bit less LINQy, though still very much uses LINQ to solve the problem: https://dotnetfiddle.net/n9Fz6K
Not sure if it looks less scary or not, nor on whether which one is better performing. But there's definitely more than one way to do this, and is possible.
LINQ takes time to get used to, and reading the code can be a bit like reading SQL -- gotta think in "set of data" rather than individual records of data. But once you get a hang of it, it's much easer to read and understand than old-school way :D.
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
var records = new List<Foo>() {
new Foo(1, 1),
new Foo(2, 1),
new Foo(3, 1),
new Foo(4, 2),
new Foo(5, 2),
new Foo(6, 3),
new Foo(7, 4),
new Foo(8, 5),
new Foo(9, 6),
new Foo(10, 7),
};
const int maxPerGroup = 3;
var groupAssigns = records
// Find and group all the Foo's that cannot be in the same collection.
.GroupBy(r => r.GroupId)
// For each Foo with the same Group id, assign an index by its position.
// This will be used in the next step to assign same-groupid foos to different buckets.
.SelectMany(g => g.Select((r, i) => new {
Bucket = i,
Record = r
}))
// As mentioned above, assign phobic Foos to different buckets.
.GroupBy(g => g.Bucket, p => p.Record)
// Now take each bucket and split it into members that fit and overflows, if any.
.Select(b => b.Chunk(maxPerGroup).ToArray())
.Select(b => new {
Group = b[0],
Overflows = b.Skip(1).SelectMany(c => c)
})
.Aggregate(
new { Groups = new List<List<Foo>>(), Overflows = new Queue<Foo>()},
(acc, next) => {
acc.Groups.Add(next.Group.ToList());
foreach(var o in next.Overflows)
acc.Overflows.Enqueue(o);
return acc;
}
);
var groups = groupAssigns.Groups;
foreach(var g in groups)
{
while(g.Count() < maxPerGroup && groupAssigns.Overflows.Count > 0)
g.Add(groupAssigns.Overflows.Dequeue());
}
groups.AddRange(groupAssigns.Overflows.Chunk(maxPerGroup).Select(c => c.ToList()));
for (int i = 0; i < groups.Count(); ++i) {
System.Console.WriteLine($"Group {i}: {Newtonsoft.Json.JsonConvert.SerializeObject(groups[i])}");
}
}
public class Foo {
public int Id;
public int GroupId;
public Foo(int id, int groupId) {
this.Id = id;
this.GroupId = groupId;
}
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
