'tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.

the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.

when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.

I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.

The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish. I am expecting each of them to only take 1.9/4 second instead.

I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.

The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.

Where could the overhead came from?

main: ncores: number of cores. node_size: size of graph (4 million node)

        for(i = 0 ; i < ncores ; i++){
            int *t = (int*)malloc(sizeof(int));
            *t = i;
            int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));

        }
        for (i = 0; i < ncores; i++)
        {
            pthread_join(thread[i], NULL);
        }

calculate_rank_p: vector is the rank vector for page rank calculation

Void *calculate_rank_pthread(void *argument) {
    int index = *(int*)argument;
    for(i = index; i < node_size ; i+=ncores)     
       current_vector[i] = calc_r(i, vector);
    return NULL;     
}

calc_r: this is just a page rank calculation using compressed row format.

double calc_r(int i, double *vector){
    double prank = 0;
    int j;
    for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
        prank += vector[col_ind[j]] * val[j];
    }
    return prank;
}

everything that is not declared are global variable



Solution 1:[1]

The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.

It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...

the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.

http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206

UPDATE

This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.

for(i = index; i < node_size ; i+=ncores) 

Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.

Solution 2:[2]

For me the same surprise when build and run in Debug (other test code though).

In release all as expected ;)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Goran Shekerov