'Does clFlush (as opposed to clFinish) actually do anything?
The OpenCL clFinish() API call blocks until all commands on a command queue have completed execution. A related function, clFlush(), supposedly
Issues all previously queued OpenCL commands in a command-queue to the device associated with the command-queue.
What does that mean? Does it make these commands skip waiting on events? That doesn't sound reasonable. Does it block until the commands have been issued? Probably not, that's what clFinish() does. It almost seems as though clFlush() doens't actually have to do anything.
What am I missing?
Solution 1:[1]
When you enqueue asynchronous commands using opencl there is no real guarantee that the GPU will actually execute those commands.
These asynchronous commands are usually memory transfers (clEnqueueWriteBuffer, clEnqueueReadBuffer) with the blocking flag set to CL_FALSE and kernel invocations (clEnqueueNDRangeKernel).
If you wish to guarantee the commands will execute, you have to either enqueue a blocking command such as an API call with the blocking flag set to CL_TRUE or call clFinish / clFlush.
clFlush basically transfers the recorded commands to the GPU. The commands are 'flushed' to the hardware command buffer, and gets executed once GPU scheduler schedules them for execution.
An important corner case is when the recorded command (or previously recorded command on the same queue) has to synchronize with a user event (clCreateUserEvent) or an event originating from command recorded to a different queue. These events are supplied via the OpenCL API and can stall flushed commands until the event is triggered.
Why clFlush is necessary?
The logic behind this, is that it is most efficient to saturate the GPU with the max amount of work (fill up a large command buffer via many enqueue calls) and then tell the GPU to execute everything using a single asynchronous call to clFlush or its blocking variant clFinish.
One reason you might want to call clFlush over clFinish is if you wish to interleave CPU work with GPU work:
clEnqueue*** // async
clEnqueue*** // async
clEnqueue*** // async
clFlush(...); // async, make sure commands will execute
// do some heavy CPU work while GPU is executing commands
clFinish(...); // synchronous, ensure all commands are done, collect results.
Some OpenCL implementations (AMD for instance) will batch commands recorded in between clFlush calls. Meaning it will treat the recorded commands as a single command from the point of view of event synchronization.
event1 = clEnqueue*** // async
event2 = clEnqueue*** // async
event3 = clEnqueue*** // async
clFlush(...); // async, previously recorded commands will execute
event4 = clEnqueue*** // async
event5 = clEnqueue*** // async
event6 = clEnqueue*** // async
clFlush(...); // async, previously recorded commands will execute
event3.wait();
// Do CPU work while the GPU processes kernels 4-6
event6.wait(); // wait for kernels 4-6
In this case, events 1-3 will be signaled after the 3rd kernel finish execution, while events 4-6 will be signaled after the 6th kernel finish execution.
This allows you to post process results on the CPU side, while the GPU is still working.
Another use case for clFlush is to reduce latency emanating from driver work and GPU scheduling. If your enqueued commands must wait for an external event (barrier), you could flush them beforehand to the GPU. Once the event is triggered, the commands are already flushed to the hardware side waiting for actual execution. Thus saving the driver latency involved in transferring the recorded commands to the GPU.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
