'What's the difference between launching with an API call vs the triple-chevron syntax?

Consider the following two function templates:

template <typename... Params>
void foo(Params... params)
{
    /* etc etc */
    my_kernel<<<grid_dims, block_dims, shmem_size, stream_id>>>(params...);
}

and:

template <typename... Params> 
void bar(Params... params)
{
    /* etc etc */
    void* arguments_ptrs[sizeof...(Params)];
    auto arg_index = 0;

    for_each_argument(
        [&](auto param) {arguments_ptrs[arg_index++] = &param;},
        params...);

    cudaLaunchKernel<decltype(my_kernel)>(
        &my_kernel, grid_dims, block_dims, argument_ptrs, shmem_size, stream_id);
}

with for_each_argument being as defined by Sean Parent.

Questions:

  • Are the semantics of foo and bar exactly identical?
  • Is there some kind of benefit to using one over the other? (e.g. perhaps the first form does heap allocation under the hood or something....)
  • Is it a good idea to use forwarding references in the second function? Both functions?


Solution 1:[1]

Are the semantics of foo and bar exactly identical?

I haven't checked in CUDA 9, but prior to that, no. The <<<>>> syntax is inline expanded to an API call and a wrapper function call. Interestingly the kernel launch APIs used are long deprecated. But the wrapper function allows for explicit argument type safety checking at compile time, which is helpful.

[EDIT: I checked CUDA 9.1 and it still uses cudaLaunch as all previous versions of the runtime API did]

Is there some kind of benefit to using one over the other? (e.g. perhaps the first form does heap allocation under the hood or something....)

Not that I am aware of.

Is it a good idea to use forwarding references in the second function? Both functions?

If the kernels are compiled at the same compilation unit scope as the calling code, then no. The toolchain automatically emits forward declarations for kernels .

Solution 2:[2]

Remember that, eventually, the runtime API needs to make driver API calls (assuming it doesn't make secret API calls which we don't know about), so eventually, what's used is cuLaunchKernel():

CUresult cuLaunchKernel ( 
    CUfunction f, 
    unsigned int  gridDimX, 
    unsigned int  gridDimY,
    unsigned int  gridDimZ, 
    unsigned int  blockDimX,
    unsigned int  blockDimY, 
    unsigned int  blockDimZ,
    unsigned int  sharedMemBytes,
    CUstream hStream, 
    void** kernelParams, 
    void** extra ) 

and that's a non-templated interface which doesn't care about kinds-of-references and such.

Of course, there is the fact that there are two ways to specify launch arguments - using kernelParams and using extra. So if you want to tweak how you go about launching kernels, you might just want to play with that.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 einpoklum