'Optimization of a short-length cyclic convolution

I have two sequences of 8 unsigned bytes and I need to compute their cyclic convolution, which yields 8 unsigned 19 bits integers. As I repeat this million times, I want to optimize.

The straightforward way takes 64 MAC operations. I already know how to accelerate this by means of SSE/AVX instructions, and this is not what I am after.

Is there another way, possibly based on the FFT or a number-theoretic transform to reduce the operation count or other technique to get some speedup ?

Actually, I don't need the 8 values: the largest one and the corresponding shift are enough.



Solution 1:[1]

I was researching a related problem, found this question and got curious about it.

Multiplication can be seen as a convolution with extra steps, and it is not hard to embed one into the other if one takes into account proper zero padding to allow for carries and so on.

In this specific problem, you can think of the cyclic convolution of 8 bytes as a multiplication modulo, i.e. C = A*B mod M, where:

A = a0 + a1 * K + a1 * K^2 + ...
B = b0 + b7 * K + b6 * K^2 + ...
C = c0 + c1 * K + c2 * K^2 + ...
M = K^8 - 1

with K > 8 * 255 * 255 chosen large enough to represent the maximum possible value at each point of the convolution without the carries spilling over the next value.

In practice we will always choose K=2^k so that embedding and extracting the individual values is trivial with using bit shifts, so k>=19.

There are multiple ways to efficienty implement this arbitrary precision product in modern hardware. The implementation of 64-bit integer multiplication on modern CPUs is just as fast latency- and throughput-wise as shorter length multiplications, so we can try to cramming our product into as few 64-bit words as necessary, which in this case is N=3, from 64N >= 8k with 19<=k<=24.

The naive way to do this uses 9 MULQ (every MULQ being only slightly slower than a regular 64-bit multiplication on modern hardware), and the Toom-Cook multiplication uses only 5 MULQ. There is in theory a way to improve on this using Winograd cyclic convolutions by splitting M into appropriate factors, computing the products modulo those factors and reconstructing the solution. This can in principle further reduce the operation count to just 4 or 3 MULQ, but at the cost of significantly increased bit manipulation trickery to avoid costly divisions.

The following code gives the expected results:

#include <iostream>
#include <cinttypes>
#include <cassert>
using namespace std;

static const int NN     = 8; // length of the convolution
static const int LGNN   = 3; // ceil(log2(NN))
static const int PAD    = 2*(8*sizeof(uint8_t)) + LGNN; // == 19
static const int BPW    = (8*sizeof(uint64_t)) / PAD; // == 3 // bytes per word
static const int NW     = (NN+BPW-1)/BPW ; // number of 64-bit words to store the sequence


// 5-point toom-cook linear convolution
void cyclic5 (uint32_t c[], const uint8_t a[], const uint8_t b[]) {

    static const uint64_t WMASK = (1ull<<BPW*PAD) -1; //0x01FFFFFFFFFFFFFF;
    static const uint32_t CMASK = (1ul <<    PAD) -1; //0x0007FFFF;

    uint64_t A[NW]={0,0,0}, B[NW]={0,0,0};
    __uint128_t Q[2*NW-1];


    for (int k=0; k<NN; ++k) A[k/BPW] |= uint64_t(a[k])<<((k%BPW)*PAD);
    for (int k=0; k<NN; ++k) B[k/BPW] |= uint64_t(b[(NN-k)%NN])<<((k%BPW)*PAD);

    // 5-point toom-cook
    // 5 mul, 12 add, 6 shift
    {
        // x==0
        Q[0] = (__uint128_t)(A[0])               * (__uint128_t)(B[0]);
        // x==2
        Q[1] = (__uint128_t)(A[0]+2*A[1]+4*A[2]) * (__uint128_t)(B[0]+2*B[1]+4*B[2]);
        // x==1
        Q[2] = (__uint128_t)(A[0]+  A[1]+  A[2]) * (__uint128_t)(B[0]+  B[1]+  B[2]);
        // x==1/2
        Q[3] = (__uint128_t)(A[2]+2*A[1]+4*A[0]) * (__uint128_t)(B[2]+2*B[1]+4*B[0]);
        // x==infinity
        Q[4] = (__uint128_t)(A[2])               * (__uint128_t)(B[2]);
    }

    // recover the partial products by polynomial interpolation
    {
        // 6 add, 4 shift
        Q[1] -= Q[0]; Q[1] -= 16*Q[4]; Q[1]>>=1;
        Q[3] -= Q[4]; Q[3] -= 16*Q[0]; Q[3]>>=1;
        Q[2] -= Q[0]; Q[2] -=    Q[4];

        // 5 add, 2 shift
        // 2 add, 2 shift, 1 di!
        {
            Q[2]  = 5*Q[2]  - Q[1] - Q[3];
            Q[1] -= 2*Q[2];
            Q[3] -= 2*Q[2];

            __uint128_t Q1 = (4*Q[3]-Q[1])/15; //  division by 15 can be replaced by a few additions and shifts
            __uint128_t Q3 = Q[3] - 4*Q1;

            Q[1] = Q1;
            Q[3] = Q3;
        }

    }

    uint64_t P[2*NW-1 +1];

    for (int w=0; w<2*NW-1; ++w) P[w  ]  = (uint64_t)(Q[w]) & WMASK;
    for (int w=0; w<2*NW-2; ++w) P[w+1] += (uint64_t)(Q[w] >> (BPW*PAD));


    // wrap around and extract
    for (int k=0; k<2*NN-1; ++k) {
        uint32_t v = (P[k/BPW] >> (k%BPW)*PAD) & CMASK;
        if (k<NN) c[k]     = v;
        else      c[k%NN] += v;
    }
}


// "naive" 3x3-term multiplication
void cyclic9 (uint32_t c[], const uint8_t a[], const uint8_t b[]) {

    static const uint64_t WMASK = (1ull<<BPW*PAD) -1; //0x01FFFFFFFFFFFFFF;
    static const uint32_t CMASK = (1ul <<    PAD) -1; //0x0007FFFF;

    uint64_t A[NW]={0,0,0}, B[NW]={0,0,0};
    uint64_t P[2*NW-1 +1] = {0,0,0,0,0,0};

    for (int k=0; k<NN; ++k) A[k/BPW] |= uint64_t(a[k])<<((k%BPW)*PAD);
    for (int k=0; k<NN; ++k) B[k/BPW] |= uint64_t(b[(NN-k)%NN])<<((k%BPW)*PAD);

    // linear convolution aka product, with the standard multiplication algorithm
    for (int i=0; i<NW; ++i)
        for (int j=0; j<NW; ++j) {
            // will compile to a single MULQ in x86-64 assembly
            __uint128_t Q = (__uint128_t)A[i] * (__uint128_t)B[j];

            P[i+j  ] += (uint64_t)(Q) & WMASK;
            P[i+j+1] += (uint64_t)(Q >> (BPW*PAD));
        }

    // wrap around and extract
    for (int k=0; k<2*NN-1; ++k) {
        uint32_t v = (P[k/BPW] >> (k%BPW)*PAD) & CMASK;
        if (k<NN) c[k]     = v;
        else      c[k%NN] += v;
    }
}

// basic 64x 16-bit FMA
void cyclic64 (uint32_t c[], const uint8_t a[], const uint8_t b[]) {

    for (int k=0; k<NN; ++k) c[k] = 0;

    // linear convolution, aka product
    for (int i=0; i<NN; ++i)
        for (int j=0; j<NN; ++j)
            c[(NN+i-j)%NN] += uint16_t(a[i])*uint16_t(b[j]);
}

int main() {

    assert (PAD==19);
    assert (BPW==3);
    assert (NW==3);       

    uint8_t B[] = {0x92, 0x16, 0x5e, 0xf1, 0x29, 0xe9, 0xbb, 0xcd, 0x8e, 0xd2, 0xa8, 0xb8, 0xae, 0xc1, 0x4a, 0x60};

    uint32_t R[NN];

    cyclic64(R, B, B+NN);
    for (int i=0; i<NN; ++i) cout << R[i] << "  "; cout << endl;

    cyclic9(R, B, B+NN);
    for (int i=0; i<NN; ++i) cout << R[i] << "  "; cout << endl;

    cyclic5(R, B, B+NN);
    for (int i=0; i<NN; ++i) cout << R[i] << "  "; cout << endl;

    return 0;
}

Solution 2:[2]

The cyclic convolution can be calculated by taking the Discrete Fourier Transform (DFT) of each input, multiplying the transforms, and taking the inverse DFT. Using a Fast Fourier Transform Algorithm, the DFT and its inverse can be calculated in N*log(N) operations, and then another N ops to multiply the transforms. So roughly speaking you need 3N*log(N)+N operations, which works out to 80 for your input size of 8. And, the operations in the FFT method are complex number operations, not just MACs.

However, there's one more optimization: since the input data are real, you can represent the transform in N/2+1 complex points without loss of information. There are real-valued transforms (and inverse transforms) that take advantage of this property. As a rule of thumb, this is equivalent to doing a transform that is half as long. So if we plug 4 into 3N*log(N)+N, we get 28. Now we need to consider the complex number issue: a complex multiply is two multiplies and an add for each of the real and imaginary components. So each complex op is about equivalent to 3 MACs, and we see that this is still slower than direct convolution.

The FFT approach does start paying off as the data sizes get bigger. If you were working with 2048-long inputs, the number of operations would be 3*10240 + 1024 = 34k operations. Even multiplied by 3 for the complex number overhead, this compares very favorably to the ~4M operations of the direct implementation.

Another case in which the FFT approach is worth considering is if you need to convolve one array against many others, or all against all. In that case you can calculate the input transforms once and reuse them. For K sequences, if you need to do all K^2 cross-convolutions, you could perform K transforms, K^2 complex array multiplies, and K^2 inverse transforms. For 10 arrays of input size 8, that's less than 1500 complex number operations (10*4*log(4) + 500 + 100*4*log(4) for inputs, transform multiplies, and outputs). Doing the direct approach would require 100*64 MACs, so the FFT approach wins out.

But for your case of pairs, a good direct implementations seems like it will be the hands down winner.

Solution 3:[3]

In "Fast Fourier Transform and Convolution Algorithms", Nussbaumer reports an optimized method to compute a convolution of 8 terms using 14 multiplications and 46 additions (on reals). I doubt that better can be done using standard arithmetic.

I have the feeling that the Fermat/Euler-number transform is relevant, but I couldn't fill out the details.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 mtrw
Solution 3