Category "sse"

Is it possible to popcount __m256i and store result in 8 32-bit words instead of the 4 64-bit using Wojciech Mula algorithm's?

I have recently discovered that AVX2 doesn't have a popcount for __m256i and the only way I found to do something similar is to follow the Wojciech Mula algori

Efficiently shift-or large bit vector

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift th

Does gcc use Intel's SSE 4.2 instructions for text processing if available?

I read here that Intel introduced SSE 4.2 instructions for accelerating string processing. Quote from the article: The SSE 4.2 instruction set, first implement

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answer

Fastest Implementation of the Natural Exponential Function Using SSE

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ). I have an implementation whic