'Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

I'm a newbie at instruction optimization.

I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays.

The C code is as follows:

float dotp(               
    const float  x[],   
    const float  y[],     
    const short  n      
)
{
    short i;
    float suma;
    suma = 0.0f;

    for(i=0; i<n; i++) 
    {    
        suma += x[i] * y[i];
    } 
    return suma;
}

I use the test frame provided by Agner Fog on the web testp.

The arrays which are used in this case are aligned:

int n = 2048;
float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float);
char *c = b+n*sizeof(float);

float *x = (float*)a;
float *y = (float*)b;
float *z = (float*)c;

Then I call the function dotp, n=2048, repeat=100000:

 for (i = 0; i < repeat; i++)
 {
     sum = dotp(x,y,n);
 }

I compile it with gcc 4.8.3, with the compile option -O3.

I compile this application on a computer which does not support FMA instructions, so you can see there are only SSE instructions.

The assembly code:

.L13:
        movss   xmm1, DWORD PTR [rdi+rax*4]  
        mulss   xmm1, DWORD PTR [rsi+rax*4]   
        add     rax, 1                       
        cmp     cx, ax
        addss   xmm0, xmm1
        jg      .L13

I do some analysis:

          μops-fused  la    0    1    2    3    4    5    6    7    
movss       1          3             0.5  0.5
mulss       1          5   0.5  0.5  0.5  0.5
add         1          1   0.25 0.25               0.25   0.25 
cmp         1          1   0.25 0.25               0.25   0.25
addss       1          3         1              
jg          1          1                                   1                                                   -----------------------------------------------------------------------------
total       6          5    1    2     1     1      0.5   1.5

After running, we get the result:

   Clock  |  Core cyc |  Instruct |   BrTaken | uop p0   | uop p1      
--------------------------------------------------------------------
542177906 |609942404  |1230100389 |205000027  |261069369 |205511063 
--------------------------------------------------------------------  
   2.64   |  2.97     | 6.00      |     1     | 1.27     |  1.00   

   uop p2   |    uop p3   |  uop p4 |    uop p5  |  uop p6    |  uop p7       
-----------------------------------------------------------------------   
 205185258  |  205188997  | 100833  |  245370353 |  313581694 |  844  
-----------------------------------------------------------------------          
    1.00    |   1.00      | 0.00    |   1.19     |  1.52      |  0.00           

The second line is the value read from the Intel registers; the third line is divided by the branch number, "BrTaken".

So we can see, in the loop there are 6 instructions, 7 uops, in agreement with the analysis.

The numbers of uops run in port0 port1 port 5 port6 are similar to what the analysis says. I think maybe the uops scheduler does this, it may try to balance loads on the ports, am I right?

I absolutely do not understand know why there are only about 3 cycles per loop. According to Agner's instruction table, the latency of instruction mulss is 5, and there are dependencies between the loops, so as far as I see it should take at least 5 cycles per loop.

Could anyone shed some insight?

==================================================================

I tried to write an optimized version of this function in nasm, unrolling the loop by a factor of 8 and using the vfmadd231ps instruction:

.L2:
    vmovaps         ymm1, [rdi+rax]             
    vfmadd231ps     ymm0, ymm1, [rsi+rax]       

    vmovaps         ymm2, [rdi+rax+32]          
    vfmadd231ps     ymm3, ymm2, [rsi+rax+32]    

    vmovaps         ymm4, [rdi+rax+64]          
    vfmadd231ps     ymm5, ymm4, [rsi+rax+64]    

    vmovaps         ymm6, [rdi+rax+96]          
    vfmadd231ps     ymm7, ymm6, [rsi+rax+96]   

    vmovaps         ymm8, [rdi+rax+128]         
    vfmadd231ps     ymm9, ymm8, [rsi+rax+128]  

    vmovaps         ymm10, [rdi+rax+160]               
    vfmadd231ps     ymm11, ymm10, [rsi+rax+160] 

    vmovaps         ymm12, [rdi+rax+192]                
    vfmadd231ps     ymm13, ymm12, [rsi+rax+192] 

    vmovaps         ymm14, [rdi+rax+224]                
    vfmadd231ps     ymm15, ymm14, [rsi+rax+224] 
    add             rax, 256                    
    jne             .L2

The result:

  Clock   | Core cyc |  Instruct  |  BrTaken  |  uop p0   |   uop p1  
------------------------------------------------------------------------
 24371315 |  27477805|   59400061 |   3200001 |  14679543 |  11011601  
------------------------------------------------------------------------
    7.62  |     8.59 |  18.56     |     1     | 4.59      |     3.44


   uop p2  | uop p3  |  uop p4  |   uop p5  |   uop p6   |  uop p7  
-------------------------------------------------------------------------
 25960380  |26000252 |  47      |  537      |   3301043  |  10          
------------------------------------------------------------------------------
    8.11   |8.13     |  0.00    |   0.00    |   1.03     |  0.00        

So we can see the L1 data cache reach 2*256bit/8.59, it is very near to the peak 2*256/8, the usage is about 93%, the FMA unit only used 8/8.59, the peak is 2*8/8, the usage is 47%.

So I think I've reached the L1D bottleneck as Peter Cordes expects.

==================================================================

Special thanks to Boann, fix so many grammatical errors in my question.

=================================================================

From Peter's reply, I get it that only "read and written" register would be the dependence, "writer-only" registers would not be the dependence.

So I try to reduce the registers used in loop, and I try to unrolling by 5, if everything is ok, I should meet the same bottleneck, L1D.

.L2:
    vmovaps         ymm0, [rdi+rax]    
    vfmadd231ps     ymm1, ymm0, [rsi+rax]    

    vmovaps         ymm0, [rdi+rax+32]    
    vfmadd231ps     ymm2, ymm0, [rsi+rax+32]   

    vmovaps         ymm0, [rdi+rax+64]    
    vfmadd231ps     ymm3, ymm0, [rsi+rax+64]   

    vmovaps         ymm0, [rdi+rax+96]    
    vfmadd231ps     ymm4, ymm0, [rsi+rax+96]   

    vmovaps         ymm0, [rdi+rax+128]    
    vfmadd231ps     ymm5, ymm0, [rsi+rax+128]   

    add             rax, 160                    ;n = n+32
    jne             .L2 

The result:

    Clock  | Core cyc  | Instruct  |  BrTaken |    uop p0  |   uop p1  
------------------------------------------------------------------------  
  25332590 |  28547345 |  63700051 |  5100001 |   14951738 |  10549694   
------------------------------------------------------------------------
    4.97   |  5.60     | 12.49     |    1     |     2.93   |    2.07    

    uop p2  |uop p3   | uop p4 | uop p5 |uop p6   |  uop p7 
------------------------------------------------------------------------------  
  25900132  |25900132 |   50   |  683   | 5400909 |     9  
-------------------------------------------------------------------------------     
    5.08    |5.08     |  0.00  |  0.00  |1.06     |     0.00    

We can see 5/5.60 = 89.45%, it is a little smaller than urolling by 8, is there something wrong?

=================================================================

I try to unroll loop by 6, 7 and 15, to see the result. I also unroll by 5 and 8 again, to double confirm the result.

The result is as follow, we can see this time the result is much better than before.

Although the result is not stable, the unrolling factor is bigger and the result is better.

            | L1D bandwidth     |  CodeMiss | L1D Miss | L2 Miss 
----------------------------------------------------------------------------
  unroll5   | 91.86% ~ 91.94%   |   3~33    | 272~888  | 17~223
--------------------------------------------------------------------------
  unroll6   | 92.93% ~ 93.00%   |   4~30    | 481~1432 | 26~213
--------------------------------------------------------------------------
  unroll7   | 92.29% ~ 92.65%   |   5~28    | 336~1736 | 14~257
--------------------------------------------------------------------------
  unroll8   | 95.10% ~ 97.68%   |   4~23    | 363~780  | 42~132
--------------------------------------------------------------------------
  unroll15  | 97.95% ~ 98.16%   |   5~28    | 651~1295 | 29~68

=====================================================================

I try to compile the function with gcc 7.1 in the web "https://gcc.godbolt.org"

The compile option is "-O3 -march=haswell -mtune=intel", that is similar to gcc 4.8.3.

.L3:
        vmovss  xmm1, DWORD PTR [rdi+rax]
        vfmadd231ss     xmm0, xmm1, DWORD PTR [rsi+rax]
        add     rax, 4
        cmp     rdx, rax
        jne     .L3
        ret


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source