'Interleave bits efficiently
I need to make uint64_t out of 2 uint32_t interleaving the bits: if A=a0a1a2...a31 and B=b0b1...b31, I need C=a0b0a1b1...a31b31. Is there a way to do this efficiently? So far I've got only the naive approach with a for loop of 32 iterations, where each iteration does C|=((A&(1<<i))<<i)|((B&(1<<i))<<(i+1)).
I guess there should be some mathematical trick like multiplying A and B by some special number which results in interleaving their bits with zeros in the resulting 64-bit number, so that what only remains is to or these products. But I can't find such multiplier.
Another potential way to go is a compiler intrinsic or assembly instruction, but I don't know of such.
Solution 1:[1]
For larger integers, it's worth mentioning the clmul x86 extension for finite field multiplication (carryless multiplication). Interleaving an integer with zeros is equivalent to a carryless multiplication of the integer with itself, which is a single ALU instruction.
Solution 2:[2]
Would a short, precalculated array lookup count as a "mathematical trick"?
Precalculate an array of 256 uint16_ts:
static const uint16_t lookup[256]={0x0000, 0x0001, 0x0005 ..., 0x5555};
We can interleave two eight-bit values, and come up with a 16 bit value easily:
uint16_t interleave(uint8_t a, uint8_t b)
{
return (lookup[a] << 1) | lookup[b];
}
How to extend this to interleave two 32-bit values into a 64-bit value should be obvious: call this four times, for each one of the four bytes that make up a uint32_t, then << an | the results together. Bribe the compiler to inline the whole thing, and the end result should be fairly quick and cheap.
Since RAM is cheap these days, you might want to consider a precalculated table of 65536 uint32_ts, also.
Solution 3:[3]
To make saolof's answer concrete, the following is an implementation using the CLMUL instruction set, interleaving two pairs of uint32_ts per call:
#include <immintrin.h>
#include <stdint.h>
typedef struct {
uint32_t x;
uint32_t y;
} uint32_2;
static inline void interleave_clmul(uint32_2 *input, uint64_t *out) {
__m128i xy = _mm_load_si128((const __m128i *)input);
xy = _mm_shuffle_epi32(xy, 0b11011000);
// Two carryless multiplies
__m128i p2 = _mm_clmulepi64_si128(xy, xy, 0x11);
__m128i p1 = _mm_clmulepi64_si128(xy, xy, 0x00);
// Bitwise interleave the results
p2 = _mm_slli_epi16(p2, 1);
__m128i p3 = _mm_or_si128(p1, p2);
_mm_storeu_si128((__m128i *)out, p3);
}
That compiles down to the following:
interleave_clmul(uint32_2*, unsigned long*):
vpshufd xmm0, xmmword ptr [rdi], 216 # xmm0 = mem[0,2,1,3]
vpclmulqdq xmm1, xmm0, xmm0, 17
vpclmulqdq xmm0, xmm0, xmm0, 0
vpaddw xmm1, xmm1, xmm1
vpor xmm0, xmm0, xmm1
vmovdqu xmmword ptr [rsi], xmm0
ret
As the simplicity would suggest, on my (Haswell) machine, this system is significantly faster than the corresponding implementation with pdep instructions. The timings are highly noisy and sensitive to the size of the arrays involved, but for arrays of 10000 pairs it's a consistent 2.4x speedup (2.27 cycles per pair for pdep, 0.9 cycles for this impl).
I used the testing harness found on Daniel Lemire's blog.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Sam Varshavchik |
| Solution 3 | Ovinus Real |
