'Cycles Per Element V.S. actual performance of Polynomial Evaluation

In the book "Computer Systems: A Programmer's Perspective (3rd edition)"'s chapter 5, exercise 5.5 and 5.6 talked about Polynomial Evaluation:

It also gives two implementation poly() and polyh(), and says poly()'s CPE(Cycles Per Element) is 5.0 and polyh()'s CPE is 8.0, thus concludes poly() run faster than polyh(). **But with clang-12 or clang-14 on my ubuntu20.04, polyh() is much faster, instead of what these exercises said. I'm confused. **

The Polynomial Evaluation implementations:

// the naive method
double poly(double a[], double x, long degree)
{
    long i;
    double result = a[0];
    double xpwr = x;
    for (i = 1; i <= degree; i++)
    {
        result += a[i] * xpwr;
        xpwr = x * xpwr;
    }
    return result;
}

// the Horner's method
double polyh(double a[], double x, long degree)
{
    long i;
    double result = a[degree];
    for (i = degree-1; i>=0; i--)
    {
        result = a[i] + x * result;
    }

    return result;
}

My compilation flags: -O1. Full implementation (including timer) is: https://godbolt.org/z/3eW8Wzr7z

My time cost result:

   polyh: took   2.318 ms, loop=10, avg =  0.232 ms
    poly: took  78.980 ms, loop=10, avg =  7.898 ms

Why polyh run faster with large CPE?

update: Based on the comments of @Passer By, I use the website quich-bench for time cost measurement, and with different array size, the benchmark result is different:

n = 1000, poly() is faster (https://quick-bench.com/q/EpDmf22VD_E0CvLN0-6TY_Ye8bU)

n = 10000 , polyh() is much faster (https://quick-bench.com/q/yuzoVzz_KhWv1gJ-_j9wlZtfWVM)

c++performance optimization

Solution 1:^[1]

I think there is some confusion regarding the statements in the book. The link you have provided clearly shows polyh() to have less CPE than poly():

polyh(double*, double, long):
# skipping non-loop code...
    mulsd   xmm0, xmm1
    addsd   xmm0, qword ptr [rdi + 8*rsi - 16]
    add     rsi, -1
    cmp     rsi, 1
    jg      .LBB1_2

poly(double*, double, long):
# skipping non-loop code...
    movsd   xmm3, qword ptr [rdi + 8*rax + 8]
    mulsd   xmm3, xmm2
    addsd   xmm0, xmm3
    mulsd   xmm2, xmm1
    add     rax, 1
    cmp     rsi, rax
    jne     .LBB0_2

Clearly polyh() is more precise code in comparission with poly().

Now lets talk about optimization. First of all -O0 is used to disable optimization. -01 is the minimum optimizations.

But even if you throw optimization out of the window the code in polyh() is optimized before even compilation. It has only 1 of each multiplication, addition and assigment while poly() has 2 multiplications and assigments.

Clearly polyh() is leaner and farter code.

UPDATE: After updated question here is what I found. I tested with same quick-bench but used GCC instead of CLANG as I was using on my computer, and thee results are still same. polyh() wins even with 1000 iterations.

https://quick-bench.com/q/_0IppR0fGBncrR60s5WtUiTq5U8

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Cycles Per Element V.S. actual performance of Polynomial Evaluation

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]