'Pytorch Model Prediction on GPU or CPU speed improvement

Running a Multi layer perceptron model on CPU is faster then running it on GPU

device  = torch.device("cuda")
MODEL = MLP(num_classes=len(MODEL_META["labels"])).to(device)
checkpoint = torch.load(path,map_location=device)
MODEL.load_state_dict(checkpoint)

i run it with current code:

for i in data:
    v = data[i:256]
    v = v[0:1600]
    v = np.pad(v,(0,1600-256),'constant')
    x = torch.from_numpy(v).float().view(-1,1600).to(device=device)
    with torch.no_grad():
        out = MODEL(x)

on the same data i have GPU finishing this loop in 3.1798946857452393 seconds and CPU executes in 2.5446364879608154 seconds

now if i load a Convolutional neural network model trained from same data i have GPU executing in 4.280640602111816 seconds and CPU in 8.113759756088257 seconds.

Using multithreading i can split the work when running models on CPU like this:

for i in range threads:
    p = multiprocessing.Process(target=my_search_function,parms))
        jobs.append(p)
        p.start()
                    


for proc in jobs:
   proc.join()  

and just by using 2 CPU cores i have nearly GPU performance.

Running in virtual Machine (Proxmox): 12core cpu 3900x and GTX1060 6G (pass trough) Ubuntu 20.04.4 LTS 8g ram.

Am i doing something wrong or its the correct behaviour? Or any tips to improve performance?



Solution 1:[1]

@Zoom this is my NN model

class MLP(nn.Module):
def __init__(self,num_classes=6):
    super(MLP,self).__init__()
   
    hidden_1 = 512
    hidden_2 = 512
    self.fc1 = nn.Linear(1600, 128)
    self.fc2 = nn.Linear(128,256)
    self.fc3 = nn.Linear(256,512)
    self.fc4 = nn.Linear(512,512)
    self.fc5 = nn.Linear(512,num_classes)
    self.droput = nn.Dropout(0.2)
    self.relu1 = nn.ReLU()
    self.relu2 = nn.ReLU()
    self.relu3 = nn.ReLU()
    self.softmax = nn.Softmax()

   

def forward(self,x):
    # flatten image input
    x = x.view(-1,1600)
    x = self.fc1(x)
    x = self.relu1(x)
    x = self.fc2(x)
    x = self.relu2(x)
    x = self.fc3(x)
    x = self.relu3(x)
    x = self.droput(x)
    x = self.fc5(x)
    return x

and i use it like this

device  = torch.device("cuda")
MODEL = MLP(num_classes=len(MODEL_META["labels"])).to(device)
checkpoint = torch.load(path,map_location=device)
MODEL.load_state_dict(checkpoint)

for i in data:
   v = data[i:256]
   v = v[0:1600]
   v = np.pad(v,(0,1600-256),'constant')
   x = torch.from_numpy(v).float().view(-1,1600).to(device=device)
   with torch.no_grad():
      out = MODEL(x)

by my understaning of batching i can pass a list of "x" tensors, and it will result in a list predictions like this:

L = []
for i in data:
  v = data[i:256]
  v = v[0:1600]
  v = np.pad(v,(0,1600-256),'constant')
  x = torch.from_numpy(v).float().view(-1,1600).to(device=device)
  L.append(x)
with torch.no_grad():
  out = MODEL(L)

where "out" will be a list of tensors of same length as input list?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Adiz