'Conditionally and efficiently locking on array of mutexes in C++

I have a simple work that is parallelized with OpenMP with very low scaling (2x speedup with 8 cores) due to too many locking:

struct MutexWithoutFalseSharing
{
    std::mutex mut;
    char padding[(64-sizeof(std::mutex))>0?(64-sizeof(std::mutex)):64];
};
MutexWithoutFalseSharing mutArr[256];

std::unordered_set<int> mapping[N];

// takes 4 milliseconds with omp, 8 milliseconds without omp 
// (8cores)
// (result is vector of std::pair of ints, 50k-100k elements)
#pragma omp parallel for
for(auto& r:result)
{
    std::lock_guard<std::mutex> lg(mutArr[r.first&255].mut);
    mapping[(r.first].insert(r.second);
}

what it does is simply taking an id (r.first) and another id (r.second) from same std::pair and inserting them into an array of sets.

The std::vector of std::pair (result) contains data like this pattern:

 r.first = {0,0,0,5,5,15,15,15,15,1000,1000,1270,4000,4000,4000,4000}

so that the locking algorithm picks same lock repeatedly, causing unnecessary extra latency. To evade the unnecessary extra locking, I'm thinking of this:

int lastId = -1000;
// assuming subResult is sub-set of original result (per-thread)
for(auto & r:subResult)
{
    if(lastId != r.first)
    {
        mutexArray[lastId].unlock();
        mutexArray[r.first].lock();
        lastId=r.first;
    }
    do_something_that_uses(r.first);
}

but this is not very exception-friendly nor undefined-behavior/deadlock-free. For example, second thread may run its own iteration before the first thread and unlock is wasted (because the first thread did not lock it).

Is there a sane/RAII-friendly way of doing this, without explicit locking? If there is none, how can I check if the selected lock (from array of locks) was locked before calling unlock?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source