'Why is tensorflow inference time in GPU slower than Matlab inference time in GPU?

I have trained a densenet-based cnn in Matlab using the Deep Learning Toolbox. It takes an input with shape 1x5000x12.I have a training data structure with size 1x5000x12xn, with n being the number of images. The cnn takes the 1x5000x12xi slice (being i the ith image of the training data) and makes a binary clasification.

I have this piece of code to evaluate the cnn with testing data that have the same shape as training data.

When doing the inference with CPU:

%%
len=64;
YPred_1=categorical(zeros(len,1));
scores_1=zeros(len,2);
net=net_fv;
%%
tic;
for i=1:len
   [YPred_1(i),scores_1(i,:)]=classify(net,X(:,:,:,i),'ExecutionEnvironment','cpu');
end
toc;

>>Elapsed time is 10.131882 seconds.

When using GPU:

%%
len=64;
YPred_1=categorical(zeros(len,1));
scores_1=zeros(len,2);
net=net_fv;
%%
tic;
for i=1:len
   [YPred_1(i),scores_1(i,:)]=classify(net,X(:,:,:,i),'ExecutionEnvironment','gpu');
end
toc;

>>Elapsed time is 1.853006 seconds.

GPU inference is faster than CPU inference in Matlab.

Later, I have exported the model to use Tensorflow in my pc and the results are unexpected for me:

Code:

import os
os.add_dll_directory("C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2/bin")
#The next two lines disable gpu usage to make cpu work. When comented, gpu is used.
# Otherwise, cpu is used 
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
import tensorflow as tf
import numpy as np
from tensorflow import keras
import time

if __name__ == "__main__":
    labels=np.load(r'D:\tfg\codigo\proyect\final_version\labels_batch_0.npy')
    predictions=np.zeros((64,))
    pathfile=r'D:\tfg\codigo\proyect\final_version\ECG_net.pb'
    model = tf.saved_model.load(pathfile)
    inference = model.signatures["serving_default"]
    raw_input=np.load(r'D:\tfg\codigo\database\test_data\test_dataset_batch_0.npy')
    input=raw_input.astype(np.float32)
    start=time.time()
    for i in range(64):
        tensor=input[np.newaxis,i,:,:,:]
        #####result=np.argmax(np.asarray(inference(imageinput=tensor)['softmax']))
        #####print(result)
        #####predictions[i]=result
        #print(inference(imageinput=tensor))
        inference(imageinput=tensor)
    end=time.time()
    print("The time of execution of above program is :", end-start)

When doing inference with CPU: The time of execution of above program is : 3.842860221862793

When doing inference with GPU: The time of execution of above program is : 4.381521940231323

Now, inference in gpu is slower than inference in cpu

Finally, running the same code on google colab shows the following:

When doing inference with CPU: The time of execution of above program is : 12.010186910629272

When doing inference with GPU: The time of execution of above program is : 1.1063041687011719

Now, in colab, gpu is faster again than cpu.

I honestly do not understand what is happening, because the same code in colab goes as expected but in my pc does a strange thing. The thing is, with the same hardware Matlab goes the same way as google colab.

How can i make Tensorflow go faster with gpu in my pc?

Data:

OS: Windows 10 home, 64-bit

GPU: NVIDIA GeForce MX150

GPU driver: 512.15

CUDA version: 11.2.0

cuDNN version: 8.1.0.7

Tensorflow version: 2.6.0

Python version: 3.9.12

Matlab version: R2021b

Description of my gpu from Matlab:

ans = 

  CUDADevice with properties:

                      Name: 'NVIDIA GeForce MX150'
                     Index: 1
         ComputeCapability: '6.1'
            SupportsDouble: 1
             DriverVersion: 11.6000
            ToolkitVersion: 11
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 2.1474e+09
           AvailableMemory: 1.6376e+09
       MultiprocessorCount: 3
              ClockRateKHz: 1531500
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

Thanks in advance!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source