'Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.

A naive implementation is below:

const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS];    // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};

for (int i = 0; i < N_OBS; ++i) {
  acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}

Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.

c++simd avx avx2

Solution 1:^[1]

Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).

No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.

void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
    // Count of OpenMP threads = CPU cores to use
    constexpr int ompThreadsCount = 4;

    // Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
    // Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
    alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
    memset( &accumulators, 0, sizeof( accumulators ) );

    // Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
    for( int i = 0; i < ompThreadsCount; i++ )
    {
        // Grab a slice of the output buffer
        double* const acc = &accumulators[ i * 16 ];

        // Compute a slice of the source data for this thread
        const size_t first = i * length / ompThreadsCount;
        const size_t last = ( i + 1 ) * length / ompThreadsCount;

        // Accumulate into thread-local portion of the buffer
        for( size_t i = first; i < last; i++ )
        {
            const int8_t idx = indices[ i ];
            acc[ idx ] += rsi[ i ];
        }
    }

    // Reduce 16*N scalars to 16 with a few AVX instructions
    for( int i = 0; i < 16; i += 4 )
    {
        __m256d v = _mm256_load_pd( &accumulators[ i ] );
        for( int j = 1; j < ompThreadsCount; j++ )
        {
            __m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
            v = _mm256_add_pd( v, v2 );
        }
        _mm256_storeu_pd( rdi + i, v );
    }
}

The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.

P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.

Solution 2:^[2]

If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.

Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).

As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).

Untested example implementation:

using sparse_matrix = std::vector<std::vector<int> >;

// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
    sparse_matrix res;
    res.resize(n_bins);

    // count entries for each bin: 
    for(int i=0; i<n_obs; ++i) {
        // assert(offsets[i] < n_bins);
        res[offsets[i]].push_back(i);
    }

    return res;
}

void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
    for(std::size_t row=0; row<indexes.size(); ++row) {
        double sum = 0;
        for(int col : indexes[row]) {
            // you can manually vectorize this using _mm256_i32gather_pd,
            // but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
            sum += obs[col];
        }
        acc[row] = sum;
    }
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Soonts
Solution 2	marc_s

'Accumulating Doubles Into Bins via intrinsics

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]