'Tensorflow GPU profiling

I am training a model using the TF keras API, the issue I am having is that I am unable to maximise the usage of the GPU, it is under-utilised in both memory & processing.

When profiling the model, I can see a lot of operations labelled as _Send which I assume is some data hopping between GPU & CPU.

enter image description here

Since I am using keras, I am not directly placing variables on device so I am not clear on why this is occuring or how to optimise.

Another interesting side effect seems to be that larger batches make training slower, with huge long waits for the GPU to get data from the CPU.

The profiler also suggests:

59.4 % of the total step time sampled is spent on 'Kernel Launch'. It could be due to CPU contention with tf.data. In this case, you may try to set the environment variable TF_GPU_THREAD_MODE=gpu_private.

I have set this env var at the top of the notebook, with no effect - I am not clear on how to check if it is having the intended effect.

Your help here would be greatly appreciated, I have read all the available guides on the tensorflow docs.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source