'How to correctly measure the running time of an opencl program

When I used opencl to test the performance of the program, I encountered some doubts. The original intention of our experiment was to test the calculation time of the program, but there were some problems in the actual process. In the following code, timex uses the method of binding events to calculate the results. However, in the actual results, we found that (end_time - start_time) < time1+time2+time3 . We don't know why, and we're also curious how exactly we should calculate the running time of the task.

start_time = clock()
compute(){
    writeData();  // time1
    clEnqueueNDRangeKernel(); // time2
    readData(); //time3
    do_other();
}
end_time = clock()


Solution 1:[1]

It depends on what exactly you want to measure. If it is just for profiling, you could be interested in time1, time2 and time3 separately. If it is for measuring the performance of the compute(), you want to measure its entire runtime.

To measure time with nanosecond precision, use this clock:

#include <chrono>
class Clock {
private:
    typedef chrono::high_resolution_clock clock;
    chrono::time_point<clock> t;
public:
    Clock() { start(); }
    void start() { t = clock::now(); }
    double stop() const { return chrono::duration_cast<chrono::duration<double>>(clock::now()-t).count(); }
};

In your example, the issue with (end_time-start_time) < time1+time2+time3 may have two possible reasons:

  1. Your clock implementation is not accurate enough and it is just rounding error.
  2. You didn't use clFinish(queue);. The OpenCL commands are only enqueued, but not executed right away. So clEnqueueNDRangeKernel(); for example immediately returns, and if you measure time before and time after, you get practically 0. To wait until the Kernel really has been executed, you need to call clFinish afterards.

Example:

Measure time1, time2 and time3 separately:

compute() {
    //clFinish(queue); // if there could still be stuff in the queue, finish it first
    Clock clock;
    clock.start();
    writeData(); // time1
    clFinish(queue);
    const double time1 = clock.stop(); // in seconds
    clock.start();
    clEnqueueNDRangeKernel(); // time2
    clFinish(queue);
    const double time2 = clock.stop(); // in seconds
    clock.start();
    readData(); //time3
    clFinish(queue);
    const double time3 = clock.stop(); // in seconds
    do_other();
    clFinish(queue); // don't forget to finish the queue!
}

To measure the entire execution time of compute(), you only need one clFinish at the end.

compute() {
    //clFinish(queue); // if there could still be stuff in the queue, finish it first
    Clock clock;
    clock.start();
    writeData(); // time1
    clEnqueueNDRangeKernel(); // time2
    readData(); //time3
    do_other();
    clFinish(queue); // don't forget to finish the queue!
    const double time = clock.stop(); // in seconds
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ProjectPhysX