Category "avx"

Is it possible to popcount __m256i and store result in 8 32-bit words instead of the 4 64-bit using Wojciech Mula algorithm's?

I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algori

Efficiently shift-or large bit vector

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift th

Bit-twiddling Wizardry for Index of Min or Max Element in XMM/YMM/ZMM

Is there an instruction or efficient branchless sequence of instructions to figure out the INDEX of (not the value of) the largest (or smallest) element of an u

Implementing matrix operation using AVX in C

I'm trying to implement the following operation using AVX: for (i=0; i<N; i++) { for(j=0; j<N; j++) { for (k=0; k<K; k++) { d[i][j] += 2 *