'Stripping down/modifying Yolov5 for real-time on RPi4

I want to use my Raspberry Pi 4 to detect license plates in real time without any hardware add-ons. I know it doesn't sound very feasible but hear me out.

I am using two ways of optimizing the network for this purpose: stripping down the neurons of Yolov5 and using ONNX runtime for inference, which from what I understand is optimal for RPi architecture. I also use quantization with my ONNX network. Stripping down the network works well, probably because the task is not very complicated and relatively simple network works well enough for me. I've reduced the number of weights to 200k and reached about 10 FPS. However when I reduced the number down to 27k, as an experiment, I only got an increase of 4 FPS, up to 14 FPS. I realize that the time of inference is not linearly dependent on the number of weights. However I wonder if there is any way to squeeze more FPS out of it. I don't care about precision as much.

Here's the stripped down model yaml I am using:

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [16, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [32, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [32]],
   [-1, 1, Conv, [64, 3, 2]],  # 3-P3/8
   [-1, 3, C3, [64]],
   [-1, 1, Conv, [64, 3, 2]],  # 5-P4/16
   [-1, 5, C3, [128]],
   [-1, 1, Conv, [128, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [128]],
   [-1, 1, SPPF, [128, 5]],  # 9
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [32, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [32, False]],  # 13

   [-1, 1, Conv, [32, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [64, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [64, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [128, False]],  # 20 (P4/16-medium)

   [-1, 1, Conv, [128, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [128, False]],  # 23 (P5/32-large)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

To quantize my model I use:

from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'yolo200k.onnx'
model_quant = 'yolo200k_quant.onnx'
quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)

Here's my inference script:

import cv2
import time
import numpy as np
from PIL import Image
import onnxruntime as rt

#Initialize inference
sess = rt.InferenceSession('yolo200k_quant.onnx', providers = rt.get_available_providers())
input_name = sess.get_inputs()[0].name

cap = cv2.VideoCapture(-1)
print(cap.isOpened())
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

while True:
    start = time.time()
    ret,frame = cap.read()
    
    image = frame[0:416, 0:416]
    frame_cropped = image
    image = np.expand_dims(image, axis = 3)
    image = np.transpose(image/256, (3,2,0,1))

    start_inference = time.time()
    pred_onnx = sess.run(None, {input_name: image.astype('float32')})
    print("Inference time:" + str(time.time()- start_inference))
               
    output = pred_onnx[0]

    size = pred_onnx[0].size
    dimensions = pred_onnx[0].shape[2]
    rows = size / dimensions
    confidenceIndex = 4
    labelStartIndex = 5
    modelWidth = 416.0
    modelHeight = 416.0
    xGain = modelWidth / image.shape[2]
    yGain = modelHeight / image.shape[3]
    frame = np.array(cap)

    location = [0]*4
    locations = []
    for i in range(int(rows)):
    #     index = i * dimensions
        if(output[0][i][confidenceIndex] <= 0.4):
            continue
        
        for j in range(labelStartIndex,dimensions):
            output[0][i][j] = output[0][i][j] * output[0][i][confidenceIndex]

        for k in range(labelStartIndex,dimensions):  
            if(output[0][i][k] <= 0.5):
                continue

        location[0] = (output[0][i][0] - output[0][i][2] / 2) / xGain #top left x
        location[1] = (output[0][i][1] - output[0][i][3] / 2) / yGain #top left y
        location[2] = (output[0][i][0] + output[0][i][2] / 2) / xGain #bottom right x
        location[3] = (output[0][i][1] + output[0][i][3] / 2) / yGain #bottom right y

        locations.append(location)

        cv2.rectangle(frame_cropped, (int(location[0]),int(location[1])), (int(location[2]),int(location[3])), (10, 255, 0), 2)

    end = time.time()
    fps = (1/(end-start))

    cv2.imshow('frame', frame_cropped)
    print(fps)  
    key = cv2.waitKey(1)
    if key == ord("q"):
        break
cap.release()

I use very inefficient way of cropping a frame for inference and later finding and drawing the bounding boxes, however I judge the FPS from the inference time alone, which is independent from those. The actual FPS is lower and will require further optimization. Right now I want to decrease the actual inference so that I can theoretically achieve real time. Are there any techniques to achieve that other than the ones I've already discussed, or is there a way to use them more effectively than I do? I am wondering if there is a way to use ONNX more efficiently.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Stripping down/modifying Yolov5 for real-time on RPi4

Sources

Related Questions