'Cubic root function cbrt() in Visual Studio 2012

I am writing a program in Visual Studio 2012 Professional (Windows) in C/C++ which consists of calculating many powers using pow(). I ran the profiler to find out why it takes such a long time to run and I found that pow() is the bottleneck.

I have rewritten the powers such as

pow(x,1.5) to x*sqrt(x)

and

pow(x,1.75) to sqrt(x*x*x*sqrt(x))

which significantly improved the speed of the program.

A few powers are of the kind pow(x,1.0/3.0) so I looked for the cubic root function cbrt() to speed up things but it seems not available in Visual Studio which I can hardly imagine so therefore my question:

Where can I find the cbrt() function in Visual Studio 2012 Professional and if not, what are the alternatives except for pow(x,1.0/3.0)?

Kind regards,

Ernst Jan



Solution 1:[1]

Implementation below is 4x faster than std::pow with relatively higher tolerance (0.000001) on an AVX-512 CPU. It is made of vertically auto-vectorizable loops for every basic operation like multiplication and division so that it computes 8,16,32 elements at once instead of horizontally vectorizing the Newton-Raphson loop.

#include <cmath> 
/* 
   Newton-Raphson iterative solution
   f_err(x) = x*x*x - N
   f'_err(x) = 3*x*x
   x = x - (x*x*x - N)/(3*x*x)
   x = x - (x - N/(x*x))/3   <--- repeat until error < tolerance
   but with vertical-parallelization 
*/
template < typename Type, int Simd, int inverseTolerance> 
    inline 
void cubeRootFast(Type *const __restrict__ data, 
        Type *const __restrict__ result) noexcept 
{ 
    // alignment 64 required for AVX512 vectorization
    alignas(64) 
    Type xd[Simd]; 

    alignas(64) 
    Type resultData[Simd]; 

    alignas(64) 
    Type xSqr[Simd]; 

    alignas(64) 
    Type nDivXsqr[Simd]; 

    alignas(64) 
    Type diff[Simd]; 

    // cube root checking mask
    for (int i = 0; i < Simd; i++) 
    { 
        xd[i] = data[i] <= Type(0.000001);
    } 

    // skips division by zero if input is zero or close to zero
    for (int i = 0; i < Simd; i++) 
    { 
        resultData[i] = xd[i] ? Type(1.0) : data[i]; 
    } 

    // Newton-Raphson Iterations in parallel 
    bool work = true; 
    while (work) 
    { 
        // compute x*x
        for (int i = 0; i < Simd; i++) 
        { 
            xSqr[i] = resultData[i] *resultData[i]; 
        } 

        // compute N/(x*x)
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = data[i] / xSqr[i]; 
        } 

        // compute x - N/(x*x)
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = resultData[i] - nDivXsqr[i]; 
        } 

        // compute (x-N/(x*x))/3
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = nDivXsqr[i] / Type(3.0); 
        } 

        // compute x - (x-N/(x*x))/3
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = resultData[i] - nDivXsqr[i]; 
        } 

        // compute error
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = resultData[i] - diff[i]; 
        } 

        // compute absolute error
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = std::abs(diff[i]); 
        } 

        // compute condition to stop looping (error < tolerance)?
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = diff[i] > Type(1.0/inverseTolerance); 
        } 

        // all SIMD lanes have to have zero work left to end
        Type check = 0; 
        for (int i = 0; i < Simd; i++) 
        { 
            check += diff[i]; 
        } 
        work = (check > Type(0.0)); 

        // compute the next x guess
        for (int i = 0; i < Simd; i++) 
        { 
            resultData[i] = resultData[i] - nDivXsqr[i]; 
        } 
    } 

    // if input was close to zero, output zero
    // output result otherwise
    for (int i = 0; i < Simd; i++) 
    { 
        result[i] = xd[i] ? Type(0.0) : resultData[i]; 
    } 
} 
#include <iostream> 
 
int main() 
{ 
    constexpr int n = 8192; 
    constexpr int simd = 16; 
    constexpr int inverseTolerance = 1000;
    float data[n]; 
    for (int i = 0; i < n; i++) 
    { 
        data[i] = i; 
    } 
    for (int i = 0; i < n; i += simd) 
    { 
        cubeRootFast<float, simd, inverseTolerance> (data + i, data + i); 
    } 
    for (int i = 0; i < 10; i++) 
        std::cout << data[i *i *i] << std::endl; 
    return 0; 
} 

It is tested only with GCC so it may require extra MSVC-pragmas on each loop to force auto-vectorization. If you have OpenMP, then you can also use #pragma omp simd safelen(Simd) to achieve same thing.

The performance only holds within [0,1] range. To use bigger values, you should use range reduction like this:

// example: max value is 1000
for(auto & input:inputs)
    input = input/1000.0f // normalize

for(..)
    cubeRootFast<float, simd, inverseTolerance> (input + i, input + i)

 for(auto & input:inputs)
    input = 10.0f*input // de-normalize  (1000 = 10 x 10 x 10) 

If you need only 0.005 error on low-range like [0,1000] with 16x speedup, you can try below implementation that uses polynomial approximation (Horner-Scheme is applied to compute with FMA instructions and no explicit auto-vectorization required since it doesn't include any branching/loop inside):

// optimized for [0,1] range: ~1 cycles on AVX512, 0.003 average error 
// polynomial approximation with Horner Scheme for FMA optimization 
template<typename T> 
T cubeRootFast(T x) 
{ 
 T xd = x-T(1.0); 
 T result = T(-55913.0/4782969.0); 
 result *= xd; 
 result += T(21505.0/1594323.0);  
 result *= xd; 
 result += T(-935.0/59049.0);  
 result *= xd; 
 result += T(374.0/19683.0);  
 result *= xd; 
 result += T(-154.0/6561.0);  
 result *= xd; 
 result += T(22.0/729.0);  
 result *= xd; 
 result += T(-10.0/243.0); 
 result *= xd; 
 result += T(5.0/81.0); 
 result *= xd; 
 result += T(-1.0/9.0);  
 result *= xd; 
 result += T(1.0/3.0); 
 result *= xd; 
 result += T(1.0); 
 return result; 
} 
 
// range reduction + dereduction: ~ 1 cycles on AVX512 
 for(int i=0;i<8192;i++) 
 {  
     float inp = input[i]; 
     // scaling + descaling for range [1,999] 
     float scaling = (inp>333.0f)?(1000.0f):(333.0f); 
     scaling = (inp>103.0f)?scaling:(103.0f); 
     scaling = (inp>29.0f)?scaling:(29.0f); 
     scaling = (inp>7.0f)?scaling:(7.0f); 
     scaling = (inp>3.0f)?scaling:(3.0f); 
      
     output[i] = powf(scaling,0.33333333333f)*cubeRootFast<float>(inp/scaling);  
 } 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1