'loop unrolling not giving expected speedup for floating-point dot product

 /* Inner product. Accumulate in temporary */
  void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
     long i;
     long length = vec_length(u);
     data_t *udata = get_vec_start(u);
     data_t *vdata = get_vec_start(v);
     data_t sum = (data_t) 0;

        for (i = 0; i < length; i++) {
                 sum = sum + udata[i] * vdata[i];
       }
  *dest = sum;
 }

Write a version of the inner product procedure described in the above problem that uses 6 × 1a loop unrolling . For x86-64, our measurements of the unrolled version give a CPE of 1.07 for integer data but still 3.01 for both floating-point data.

My code for 6*1a version of loop unrolling

 void inner4(vec_ptr u, vec_ptr v, data_t *dest){
       long i;
       long length = vec_length(u);
       data_t *udata = get_vec_start(u);
       data_t *vdata = get_vec_start(v);
       long limit = length -5;
       data_t sum = (data_t) 0;

      for(i=0; i<limit; i+=6){
             sum = sum +
                   ((udata[ i ] * vdata[ i ]
                  + udata[ i+1 ] * vdata[ i+1 ])
                  + (udata[ i+2 ] * vdata[ i+2 ]
                  + udata[ i+3 ] * vdata[ i+3 ]))
                   + ((udata[ i+4 ] * vdata[ i+4 ])
                  + udata[ i+5 ] * vdata[ i+5 ]);
      }
     for (i = 0; i < length; i++) {
             sum = sum + udata[i] * vdata[i];
   }
  *dest = sum;
      
 }

Question: Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.

Any idea how to solve the problem?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source