'Interleave bits efficiently

I need to make uint64_t out of 2 uint32_t interleaving the bits: if A=a0a1a2...a31 and B=b0b1...b31, I need C=a0b0a1b1...a31b31. Is there a way to do this efficiently? So far I've got only the naive approach with a for loop of 32 iterations, where each iteration does C|=((A&(1<<i))<<i)|((B&(1<<i))<<(i+1)).

I guess there should be some mathematical trick like multiplying A and B by some special number which results in interleaving their bits with zeros in the resulting 64-bit number, so that what only remains is to or these products. But I can't find such multiplier.

Another potential way to go is a compiler intrinsic or assembly instruction, but I don't know of such.



Solution 1:[1]

For larger integers, it's worth mentioning the clmul x86 extension for finite field multiplication (carryless multiplication). Interleaving an integer with zeros is equivalent to a carryless multiplication of the integer with itself, which is a single ALU instruction.

Solution 2:[2]

Would a short, precalculated array lookup count as a "mathematical trick"?

Precalculate an array of 256 uint16_ts:

static const uint16_t lookup[256]={0x0000, 0x0001, 0x0005 ..., 0x5555};

We can interleave two eight-bit values, and come up with a 16 bit value easily:

uint16_t interleave(uint8_t a, uint8_t b)
{
    return (lookup[a] << 1) | lookup[b];
}

How to extend this to interleave two 32-bit values into a 64-bit value should be obvious: call this four times, for each one of the four bytes that make up a uint32_t, then << an | the results together. Bribe the compiler to inline the whole thing, and the end result should be fairly quick and cheap.

Since RAM is cheap these days, you might want to consider a precalculated table of 65536 uint32_ts, also.

Solution 3:[3]

To make saolof's answer concrete, the following is an implementation using the CLMUL instruction set, interleaving two pairs of uint32_ts per call:

#include <immintrin.h>
#include <stdint.h>

typedef struct {
  uint32_t x;
  uint32_t y;
} uint32_2;

static inline void interleave_clmul(uint32_2 *input, uint64_t *out) {
  __m128i xy = _mm_load_si128((const __m128i *)input);

  xy = _mm_shuffle_epi32(xy, 0b11011000);

  // Two carryless multiplies
  __m128i p2 = _mm_clmulepi64_si128(xy, xy, 0x11);
  __m128i p1 = _mm_clmulepi64_si128(xy, xy, 0x00);

  // Bitwise interleave the results
  p2 = _mm_slli_epi16(p2, 1);
  __m128i p3 = _mm_or_si128(p1, p2);

  _mm_storeu_si128((__m128i *)out, p3);
}

That compiles down to the following:

interleave_clmul(uint32_2*, unsigned long*):
        vpshufd         xmm0, xmmword ptr [rdi], 216    # xmm0 = mem[0,2,1,3]
        vpclmulqdq      xmm1, xmm0, xmm0, 17
        vpclmulqdq      xmm0, xmm0, xmm0, 0
        vpaddw          xmm1, xmm1, xmm1
        vpor            xmm0, xmm0, xmm1
        vmovdqu         xmmword ptr [rsi], xmm0
        ret

As the simplicity would suggest, on my (Haswell) machine, this system is significantly faster than the corresponding implementation with pdep instructions. The timings are highly noisy and sensitive to the size of the arrays involved, but for arrays of 10000 pairs it's a consistent 2.4x speedup (2.27 cycles per pair for pdep, 0.9 cycles for this impl).

I used the testing harness found on Daniel Lemire's blog.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Sam Varshavchik
Solution 3 Ovinus Real