'How to profile how many instructions are executed during a cuda kernel launch

I want to know how many fp32 and int32 instructions are executed in a cuda kernel during a launch. Is there any way to profile it via Nvidia Nsight Compute?

Solution 1:^[1]

Is there any way to profile it via Nvidia Nsight Compute?

For nsight compute, the relevant metrics are as follows:

fp32 instructions executed:    smsp__sass_thread_inst_executed_op_fp32_pred_on.sum
integer instructions executed: smsp__sass_thread_inst_executed_op_integer_pred_on.sum

Example:

$ ncu --metrics smsp__sass_thread_inst_executed_op_fp32_pred_on.sum ./t2003
...
==PROF== Disconnected from process 27520
[27520] [email protected]
  kernel_1(const float *, float *), 2022-Apr-23 23:24:34, Context 1, Stream 16
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    smsp__sass_thread_inst_executed_op_fp32_pred_on.sum                               inst                         10,240
    ---------------------------------------------------------------------- --------------- ------------------------------

  kernel_2(const float *, float *), 2022-Apr-23 23:24:34, Context 1, Stream 17
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    smsp__sass_thread_inst_executed_op_fp32_pred_on.sum                               inst                         10,240
    ---------------------------------------------------------------------- --------------- ------------------------------

$

Note that recent versions of Nsight Compute are intended to be used on Volta and newer (compute capability 7.0 and higher) GPUs only.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Robert Crovella

'How to profile how many instructions are executed during a cuda kernel launch

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]