'Algorithm for pow(float, float)

I need an efficent algorithm to do math::power function between two floats, do you have any idea how to do this, (i need the algorithm not to use the function itself)



Solution 1:[1]

The general algorithm tends to be computing the float power as the combination of the integer power and the remaining root. The integer power is fairly straightforward, the root can be computed using either Newton - Raphson method or Taylor series. IIRC numerical recipes in C has some text on this. There are other (potentially better) methods for doing this too, but this would make a reasonable starting point for what is a surprisingly complex problem to implement. Note also that some implementations use lookup tables and a number of tricks to reduce the computation required.

Solution 2:[2]

Since IEEE-754 binary floating-point numbers are fractions, computing ab is technically an algebraic operation. However the common approach to implement powf(float a, float b) is as eb * log a, i.e. using transcendental functions.

There are a few caveats, however. log a is undefined for a < 0, while powf() allows computation with some negative a. Exponentiation, in the form of expf() suffers from error magnification, as I explained in my answer to this question. This requires us to compute log a with higher than single precision for an accurate powf() result. There are various techniques to achieve this, a simple way is to use limited amounts of double-float computation, references for which I provided in my answer to this question. The essence of double-float is that each floating-point operand is represented as a pair of float values called the "head" and the "tail", which satisfy the relation |tail| ? ½ * ulp (|head|) when properly normalized.

The code below shows an exemplary implementation of this approach. It assumes that the IEEE 754-2008 operation FMA (fused multiply-add) is available, which is exposed in C as the standard math functions fma(), fmaf(). It does not provide for handling of errno or floating-point exceptions, but it does provide for the correct handling of all 18 special cases enumerated by the ISO C standard. Tests have been performed with denormal support enabled; the code may or may not work properly within a non-IEEE-754 flush-to-zero (FTZ) environment.

The exponentiation part employs a simple argument reduction based directly on the semi-logarithmic encoding of floating-point numbers, then applies a polynomial minimax approximation on the primary approximation interval. The logarithm computation is based on log(x) = 2 atanh ((x-1) / (x+1)), combined with selective use of double-float computation, to achieve a relative accuracy of 8.3e-10. The computation of b * log a is performed as a double-float operation, the accuracy of the final exponentiation is improved by linear interpolation, by observing that ex is its own derivative, and that therefore ex+y ? ex + y ? ex, when |y| ? |x|.

Double-float computation becomes a bit iffy near overflow boundaries, there are two instances of this in the code. When the head portion of the input to exp is just causing the result to overflow to infinity, the tail portion may be negative, such that the result of powf() is actually finite. One way to address this is to decrease the value of the "head" by one ulp in such a case, an alternative is to compute the head via addition in round-to-zero mode where readily available, since this will ensure like signs for head and tail. The other caveat is that if the exponentiation does overflow, we cannot interpolate the result, as doing so would create a NaN.

It should be noted that the accuracy of the logarithm computation used here is not sufficient to ensure a faithfully-rounded powf() implementation, but it provides a reasonably small error (the maximum error I have found in extensive testing is less than 2 ulps) and it allows the code to be kept reasonably simple for the purpose of demonstrating relevant design principles.

#include <stdint.h> // for uint32_t
#include <string.h> // for memcpy
#include <math.h>   // for frexpf, ldexpf, isinf, nextafterf

#define PORTABLE (1) // 0=bit-manipulation of 'float', 1= math library functions

uint32_t float_as_uint32 (float a) 
{ 
    uint32_t r; 
    memcpy (&r, &a, sizeof r); 
    return r; 
}

float uint32_as_float (uint32_t a) 
{ 
    float r; 
    memcpy (&r, &a, sizeof r); 
    return r; 
}

/* Compute log(a) with extended precision, returned as a double-float value 
   loghi:loglo. Maximum relative error: 8.5626e-10.
*/
void my_logf_ext (float a, float *loghi, float *loglo)
{
    const float LOG2_HI =  6.93147182e-1f; //  0x1.62e430p-1
    const float LOG2_LO = -1.90465421e-9f; // -0x1.05c610p-29
    const float SQRT_HALF = 0.70710678f;
    float m, r, i, s, t, p, qhi, qlo;
    int e;

    /* Reduce argument to m in [sqrt(0.5), sqrt(2.0)] */
#if PORTABLE
    m = frexpf (a, &e);
    if (m < SQRT_HALF) {
        m = m + m;
        e = e - 1;
    }
    i = (float)e;
#else // PORTABLE
    const float POW_TWO_M23 = 1.19209290e-7f; // 0x1.0p-23
    const float POW_TWO_P23 = 8388608.0f; // 0x1.0p+23
    const float FP32_MIN_NORM = 1.175494351e-38f; // 0x1.0p-126
    i = 0.0f;
    /* fix up denormal inputs */
    if (a < FP32_MIN_NORM){
        a = a * POW_TWO_P23;
        i = -23.0f;
    }
    e = (float_as_uint32 (a) - float_as_uint32 (SQRT_HALF)) & 0xff800000;
    m = uint32_as_float (float_as_uint32 (a) - e);
    i = fmaf ((float)e, POW_TWO_M23, i);
#endif // PORTABLE
    /* Compute q = (m-1)/(m+1) as a double-float qhi:qlo */
    p = m + 1.0f;
    m = m - 1.0f;
    r = 1.0f / p;
    qhi = r * m;
    qlo = r * fmaf (qhi, -m, fmaf (qhi, -2.0f, m));
    /* Approximate atanh(q), q in [sqrt(0.5)-1, sqrt(2)-1] */ 
    s = qhi * qhi;
    r =             0.1293334961f;  // 0x1.08c000p-3
    r = fmaf (r, s, 0.1419928074f); // 0x1.22cd9cp-3
    r = fmaf (r, s, 0.2000148296f); // 0x1.99a162p-3
    r = fmaf (r, s, 0.3333332539f); // 0x1.555550p-2
    t = fmaf (qhi, qlo + qlo, fmaf (qhi, qhi, -s)); // s:t = (qhi:qlo)**2
    p = s * qhi;
    t = fmaf (s, qlo, fmaf (t, qhi, fmaf (s, qhi, -p))); // p:t = (qhi:qlo)**3
    s = fmaf (r, p, fmaf (r, t, qlo));
    r = 2 * qhi;
    /* log(a) = 2 * atanh(q) + i * log(2) */
    t = fmaf ( LOG2_HI, i, r);
    p = fmaf (-LOG2_HI, i, t);
    s = fmaf ( LOG2_LO, i, fmaf (2.f, s, r - p));
    *loghi = p = t + s;    // normalize double-float result
    *loglo = (t - p) + s;
}

/* Compute exponential base e. No checking for underflow and overflow. Maximum
   ulp error = 0.86565 
*/
float my_expf_unchecked (float a)
{
    float f, j, r;
    int i;

    // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
    j = fmaf (1.442695f, a, 12582912.f) - 12582912.f; // 0x1.715476p0, 0x1.8p23
    f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1  // log_2_hi 
    f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo 
    i = (int)j;
    // approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
    r =             1.37805939e-3f;  // 0x1.694000p-10
    r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
    r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
    r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
    r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
    // exp(a) = 2**i * r
#if PORTABLE
    r = ldexpf (r, i);
#else // PORTABLE
    float s, t;
    uint32_t ia = (i > 0) ? 0u : 0x83000000u;
    s = uint32_as_float (0x7f000000u + ia);
    t = uint32_as_float (((uint32_t)i << 23) - ia);
    r = r * s;
    r = r * t;
#endif // PORTABLE
    return r;
}

/* a**b = exp (b * log (a)), where a > 0, and log(a) is computed with extended 
   precision as a double-float. Maxiumum error found across 2**42 test cases:
   1.97302 ulp @ (0.71162397, -256.672424).
*/
float my_powf_core (float a, float b)
{
    const float MAX_IEEE754_FLT = uint32_as_float (0x7f7fffff);
    const float EXP_OVFL_BOUND = 88.7228394f; // 0x1.62e430p+6f;
    const float EXP_OVFL_UNFL_F = 104.0f;
    const float MY_INF_F = uint32_as_float (0x7f800000);
    float lhi, llo, thi, tlo, phi, plo, r;

    /* compute lhi:llo = log(a) */
    my_logf_ext (a, &lhi, &llo);
    /* compute phi:plo = b * log(a) */
    thi = lhi * b;
    if (fabsf (thi) > EXP_OVFL_UNFL_F) { // definitely overflow / underflow
        r = (thi < 0.0f) ? 0.0f : MY_INF_F;
    } else {
        tlo = fmaf (lhi, b, -thi);
        tlo = fmaf (llo, b, +tlo);
        /* normalize intermediate result thi:tlo, giving final result phi:plo */
#if FAST_FADD_RZ
        phi = __fadd_rz (thi, tlo);// avoid premature ovfl in exp() computation
#else // FAST_FADD_RZ
        phi = thi + tlo;
        if (phi == EXP_OVFL_BOUND){// avoid premature ovfl in exp() computation
#if PORTABLE
            phi = nextafterf (phi, 0.0f);
#else // PORTABLE
            phi = uint32_as_float (float_as_uint32 (phi) - 1);
#endif // PORTABLE
        }
#endif // FAST_FADD_RZ
        plo = (thi - phi) + tlo;
        /* exp'(x) = exp(x); exp(x+y) = exp(x) + exp(x) * y, for |y| << |x| */
        r = my_expf_unchecked (phi);
        /* prevent generation of NaN during interpolation due to r = INF */
        if (fabsf (r) <= MAX_IEEE754_FLT) {
            r = fmaf (plo, r, r);
        }
    }
    return r;
}

float my_powf (float a, float b)
{
    const float MY_INF_F = uint32_as_float (0x7f800000);
    const float MY_NAN_F = uint32_as_float (0xffc00000);
    int expo_odd_int;
    float r;

    /* special case handling per ISO C specification */
    expo_odd_int = fmaf (-2.0f, floorf (0.5f * b), b) == 1.0f;
    if ((a == 1.0f) || (b == 0.0f)) {
        r = 1.0f;
    } else if (isnan (a) || isnan (b)) {
        r = a + b;  // convert SNaN to QNanN or trigger exception
    } else if (isinf (b)) {
        r = ((fabsf (a) < 1.0f) != (b < 0.0f)) ? 0.0f :  MY_INF_F;
        if (a == -1.0f) r = 1.0f;
    } else if (isinf (a)) {
        r = (b < 0.0f) ? 0.0f : MY_INF_F;
        if ((a < 0.0f) && expo_odd_int) r = -r;
    } else if (a == 0.0f) {
        r = (expo_odd_int) ? (a + a) : 0.0f;
        if (b < 0.0f) r = copysignf (MY_INF_F, r);
    } else if ((a < 0.0f) && (b != floorf (b))) {
        r = MY_NAN_F;
    } else {
        r = my_powf_core (fabsf (a), b);
        if ((a < 0.0f) && expo_odd_int) {
            r = -r;
        }
    }
    return r;
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Flexo
Solution 2