Category "avx2"

Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all ob

AVX2 code cannot be faster than gcc base optmization

I am studying AVX by writing AVX code with inline assembly. In this case, I tried to implement AVX in a simple function. The function name I made is lower_all_c

Is it possible to popcount __m256i and store result in 8 32-bit words instead of the 4 64-bit using Wojciech Mula algorithm's?

I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algori