'How to calculate theoretical inference time of a network based on GPU?

I am trying to estimate how long would a GPU take to make an inference in a DL network. However, when testing the method, the theoretical and real computing times turn out to be completely different.

Here is what I am currently doing:

I obtained the network's FLOPs by using https://github.com/Lyken17/pytorch-OpCounter as follows:

macs, params = profile(model, inputs=(image, ))
tera_flop = macs * 10 ** -12 * 2

Obtaining 0.0184295 TFLOPs. Then, calculated the FLOPS for my GPU (NVIDIA RTX A3000):

4096 CUDA Cores * 1560 MHz * 2 * 10^-6 = 12.77 TFLOPS

Which gave me a theoretical inference time of:

0.0184 TFLOPs / 12.7795 TFLOPS = 0.00144 s

Then, I measured the real inference time by applying the following:

model.eval()
model.to(device)
image = image.unsqueeze(0).to(device)    

start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
reps = 300
timings = np.zeros((reps, 1))

# GPU warmup
for _ in range(10):
    _ = model(image)

# Measure performance
with torch.no_grad():
    for rep in range(reps):
        start.record()
        _ = model(image)
        end.record()
        # Wait for GPU to sync
        torch.cuda.synchronize()
        curr_time = start.elapsed_time(end)
        timings[rep] = curr_time

mean_syn = np.sum(timings) * 10 ** -3 / reps

Which gave me a real computing time of 0.028 s.

Could you please help me figure out what I am doing wrong here?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source