'Why isn't Parallel.For fast with heap-intensive operations?
For some operations Parallel scales well with the number of CPU's, but for other operations it does not.
Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?
void function1(int v) {
for (int i = 0; i < 100000000; i++) {
var q = Math.Sqrt(v);
}
}
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>();
for (int i = 0; i < 10000000; i++) {
dict.Add(i, v);
}
}
var sw = new System.Diagnostics.Stopwatch();
var iterations = 100;
sw.Restart();
for (int v = 0; v < iterations; v++) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
for (int v = 0; v < iterations; v++) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
The output on my machine:
function1 no parallel: 2 059,4 ms
function1 with parallel: 213,7 ms
function2 no parallel: 14 192,8 ms
function2 parallel: 4 491,1 ms
Environment:
Win 11, .Net 6.0, Release build
i9 12th gen, 16 cores, 24 proc, 32 GB DDR5
After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}
The result is:
function2 no parallell: 124,0 ms
function2 parallell: 402,4 ms
Is the conclusion that memory allocation does not scale well with multiple threads?...
Solution 1:[1]
tl;dr: Heap allocation contention.
Your first function is embarrassingly parallel. Each thread can do its computation with embarrassingly little interaction with other threads. So it scales up nicely to multiple threads. huseyin tugrul buyukisik correctly pointed out that your first computation makes use of the non-shared, per thread, processor registers.
Your second function, when it preallocates the dictionary, is somewhat less embarrassingly parallel. Each thread's computation is independent of the others' except for the fact that they each use your machine's RAM subsystem. So you see some thread-to-thread contention at the hardware level as thread-level cached data is written to and read from the machine-level RAM.
Your second function that does not preallocate memory is not embarrassingly parallel. Why not? Each .Add() operation must allocate some data in the shared heap. That can't be done in parallel, because all threads share the same heap. Rather they must be synchronized. The dotnet libraries do a good job of parallelizing heap operations as much as possible, but they do not avoid at least some blocking of thread B when thread A allocates heap data. So the threads slow each other down.
Separate processes rather than separate threads are a good way to scale up workloads like your non-preallocating second function. Each process has its own heap.
Solution 2:[2]
First func works in registers. More cores = more registers.
Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.
Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.
Solution 3:[3]
Parallel programming is not that simple. Using Parallel.For() or Parallel.ForEach() doesn't automatic make your program parallel. Parallel programming is not about calling any higher level function (in any programming language) to make your code parallel. Is about prepare your code to be parallel.
Actually, you are not paralleling anything at all neither func1 or func2. Backing to the foundation, the two basic types of parallelism are:
By task, which you split a complex task in smaller subtasks, each subtask to be processed at same time for different cores, CPUs or nodes (in a computer cluster)
By data, which you split a large data set into several smaller slices, each slice to be processed at same time for different cores, CPUs or nodes
Data parallelism is way more trickier to achieve and and not always provide a real performance gain.
Func1 is not really parallel, it's just a heavy piece of computation running concurrently. (Your CPU are just disputing who will finish the 100M for loop first) Using Parallel.For() you are just spawning this heavy function 100 times among your threads. A single for loop with Task.Run() inside would have nearly the same result
If your run this in only one thread/core obviously will take sometime. If you run in all your cores will be faster. No big mistery here, although being a concurrent code, not actually parallel. Besides, invoking these tasks 100 times, if you don't have these amount of CPU cores (or nodes in cluster) there's no big difference, parallel/concurrent code will be limit by the actual CPU cores in the machine (will see in a future example)
Now about the Func2 and the interaction with memory heap. Yes, every modern language with a built-in GC it's CPU expensive. One of the most expensive operation in an complex algorithm it's Garbage Collection, sometimes ad in non-optimized codes it can represents over 90% of CPU time.
Let's analyze your function2
- Declare a new Dictionary into the function scope
- Populate this Dictionary with 100M items
- Outer the scope, you called function2 inside a Parallel.For with 100 interations
- 100 different scopes populate 100 different Dictionary with 100M data
- There's no interaction between any of these scopes
As said before, this is not parallel programming, this is concurrent programming. You have separete 100 data chunks of 100M entries in each scope that doesn't intereact each other
But also there's a second factor too. Your function2 operation is a write operation (it means your adding-updading-deleting something to a collection). Well if it's just a bunch of random data and you can admit some loss and inconsistency okay. But if your're handling real data and cannot allow any kind of loss or inconsistency, bad news. There's no true parallel for writing a same memory address (object reference). You will need a synchronization contex and this will make things way slower, and these syncronized operations will always be concurrent, because if a thread is writing on memory reference, the other thread must wait until the other thread leaves. Actually, using several threads to write data might make your code slower instead faster, specially if the parallel operations are not CPU-bound.
For having real gains with data parallelism, you must have been using heavy computations uppon these partitioned data.
Let's check come code below, based on your methodology but with some changes:
var rand = new Random();
var operationSamples = 256;
var datasetSize = 100_000_000;
var computationDelay = 50;
var cpuCores = Environment.ProcessorCount;
Dictionary<int, int> datasetWithLoss = new(datasetSize);
Dictionary<int, int> dataset = new(datasetSize);
double result = 0;
Stopwatch sw = new();
ThreadPool.SetMinThreads(1, 1);
int HeavyComputation(int delay)
{
int iterations = 0;
var end = DateTime.Now + TimeSpan.FromMilliseconds(delay);
while (DateTime.Now < end)
iterations++;
return iterations;
}
double SequentialMeanHeavyComputation(int maxMilliseconds, int samples = 64)
{
double sum = 0;
for (int i = 0; i < samples; i++)
sum += HeavyComputation(maxMilliseconds);
return sum / samples;
}
double ParallelMeanHeavyComputation(int maxSecondsCount, int samples = 64, int threads = 4)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
var _lockKey = new object();
double sum = 0;
int offset = samples / threads;
List<Action> tasks = new();
for (int i = 0; i < samples; i++)
tasks.Add(new Action(() =>
{
var result = HeavyComputation(maxSecondsCount);
lock (_lockKey)
sum += result;
}));
Parallel.Invoke(new ParallelOptions { MaxDegreeOfParallelism = threads }, tasks.ToArray());
return sum / samples;
}
void SequentialDatasetPopulation(int size)
{
for (int i = 0; i < datasetSize; i++)
dataset.TryAdd(i, Guid.NewGuid().GetHashCode());
}
void ParalellDatasetPopulation(int size, int threads)
{
var _lock = new object();
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
var value = Guid.NewGuid().GetHashCode();
lock (_lock)
dataset.Add(i, value);
});
}
double SequentialReadOnlyDataset()
{
foreach (var x in dataset)
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
}
return 0;
}
double ParallelReadOnlyDataset()
{
Parallel.ForEach(dataset, x =>
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
});
return 0;
}
void ParalellDatasetWithLoss(int size, int threads)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
int value = Guid.NewGuid().GetHashCode();
datasetWithLoss.Add(i, value);
});
}
sw.Restart();
result = SequentialMeanHeavyComputation(computationDelay, operationSamples);
sw.Stop();
Console.WriteLine($"{nameof(SequentialMeanHeavyComputation)} sequential tasks: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (CPU threads match count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 100);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Higher thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 4);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Lower thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
SequentialDatasetPopulation(datasetSize);
sw.Stop();
Console.WriteLine($"{nameof(SequentialDatasetPopulation)} sequential data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
dataset.Clear();
sw.Restart();
ParalellDatasetPopulation(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetPopulation)} parallel data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParalellDatasetWithLoss(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetWithLoss)} parallel data with loss: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.WriteLine($"Lossless dataset count: {dataset.Count}");
Console.WriteLine($"Dataset with loss: {datasetWithLoss.Count}\n");
datasetWithLoss.Clear();
sw.Restart();
SequentialReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(SequentialReadOnlyDataset)} sequential reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParallelReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(ParallelReadOnlyDataset)} parallel reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.Read();
Output:
SequentialMeanHeavyComputation sequential tasks: 12 800,7ms
Available Threads: 15
ParallelMeanHeavyComputation parallel tasks (CPU threads match count): 860,3ms
Available Threads: 99
ParallelMeanHeavyComputation parallel tasks (Higher thread count): 805,0ms
Available Threads: 3
ParallelMeanHeavyComputation parallel tasks (Lower thread count): 3 200,4ms
SequentialDatasetPopulation sequential data population: 9 072,4ms
Available Threads: 15
ParalellDatasetPopulation parallel data population: 23 420,0ms
Available Threads: 15
ParalellDatasetWithLoss parallel data with loss: 6 788,3ms
Lossless dataset count: 100000000
Dataset with loss: 77057456
SequentialReadOnlyDataset sequential reading operations: 20 371,0ms
ParallelReadOnlyDataset parallel reading operations: 3 020,6ms
(Red: 25%, Orange: 56%, Green: 75%, Blue: 100%)
With task parallelism we achieved over 20x performance using 100% of CPU threads. (in this example, not always like that)
In read-only data paralelism with some computation we achieve near 6,5x faster of CPU usage 56% (with fewer computations the difference would be shorter)
But trying to implement a "real parallism" of data for writing our performance is more than twice slower and CPU can't use full potential using only 25% usage due sycronization contexts
Conclusions: Using Parallel.For does not guarantee that your code will run really in parallel neither faster. It requires a previous code/data preparation and deep analysis, benchmarks and tunings
Check also this Microsoft Documentation talking about villains in Parallel Code
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | O. Jones |
| Solution 2 | |
| Solution 3 |
