'Why cupyx.scipy.ndimage.median_filter() is slower than CPU version opencv-python's cv2.medianBlur() API

I just recently came across a case and found that when I use cupyx.scipy.ndimage.median_filter() API, (with CUDA10 or CUDA11), it runs slower than CPU version of opencv-python's API: cv2.medianBlur()

Here is a simple code snippet to verify this.

import glog
import time
import cv2
from skimage import data
import numpy as np
import cupy as cp
from cupyx.scipy import ndimage

glog.info("Imgload start.")
img = data.camera()
img = np.concatenate([img for in range(10)])
img = np.concatenate([img for in range(10)], axis=1)
glog.info(f"Imgload end. shape is {img.shape}, dtype is {img.dtype}")

glog.info("Cupy MemcpyH2D start")
t0 = time.time()
img_gpu = cp.asarray(img)
glog.info("Cupy MemcpyH2D end")
glog.info("cucim start")
med_gpu = ndimage.median_filter(img_gpu, 51)
glog.info("cucim end")
med_cpu = med_gpu.get()
glog.info(f"Cupy MemcpyD2H end. Time cost: {time.time() - t0:.4f}s")

glog.info("cv2 start")
t1 = time.time()
med2 = cv2.medianBlur(img, 51)
glog.info(f"cv2 end. Time cost: {time.time() - t1:.4f}s")

Here is the results:

root# python3 exp_git.py
I0209 09:35:53.929506 158319 exp_git.py:10] Imgload start.
I0209 09:35:54.444368 158319 exp_git.py:14] Imgload end. shape is (5120, 5120), dtype is uint8
I0209 09:35:54.444495 158319 exp_git.py:16] Cupy MemcpyH2D start
I0209 09:35:56.274070 158319 exp_git.py:19] Cupy MemcpyH2D end
I0209 09:35:56.274307 158319 exp_git.py:20] cucim start
I0209 09:35:56.462010 158319 exp_git.py:24] cucim end
I0209 09:36:04.008166 158319 exp_git.py:26] Cupy MemcpyD2H end. Time cost: 9.5632s
I0209 09:36:04.008691 158319 exp_git.py:28] cv2 start
I0209 09:36:04.676652 158319 exp_git.py:31] cv2 end. Time cost: 0.6679s

I tried to use cupyx.profiler.benchmark, but seems not working. So I instead just use nvprof to double check the GPU time. Here is what I got:

root# nvprof --print-gpu-trace python3 exp_git.py
I0209 09:37:18.500330 159091 exp_git.py:10] Imgload start.
I0209 09:37:19.008963 159091 exp_git.py:14] Imgload end. shape is (5120, 5120), dtype is uint8
I0209 09:37:19.009088 159091 exp_git.py:16] Cupy MemcpyH2D start
==159091== NVPROF is profiling process 159091, command: python3 exp_git.py
I0209 09:37:19.649279 159091 exp_git.py:19] Cupy MemcpyH2D end
I0209 09:37:19.649417 159091 exp_git.py:20] cucim start
I0209 09:37:19.837142 159091 exp_git.py:24] cucim end
I0209 09:37:27.368844 159091 exp_git.py:26] Cupy MemcpyD2H end. Time cost: 8.3596s
I0209 09:37:27.369069 159091 exp_git.py:28] cv2 start
I0209 09:37:28.036207 159091 exp_git.py:31] cv2 end. Time cost: 0.6670s
==159091== Profiling application: python3 exp_git.py
==159091== Profiling result:
Start Duration Grid Size Block Size Regs SSMem DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
632.76ms 2.1469ms - - - - - 25.000MB 11.372GB/s Pinned Device Tesla V100S-PCI 1 7 [CUDA memcpy HtoD]
727.64ms 4.5430us (21 1 1) (128 1 1) 16 0B 0B - - - - Tesla V100S-PCI 1 7 cupy_fill [254]
728.45ms 5.9840us (1 1 1) (512 1 1) 28 4.0000KB 0B - - - - Tesla V100S-PCI 1 7 cupy_sum [268]
728.53ms 2.1760us - - - - - 8B 3.5062MB/s Device Pageable Tesla V100S-PCI 1 7 [CUDA memcpy DtoH]
728.92ms 30.399us - - - - - 25.000MB 803.12GB/s Device - Tesla V100S-PCI 1 7 [CUDA memset]
820.67ms 7.52002s (204800 1 1) (128 1 1) 80 0B 0B - - - - Tesla V100S-PCI 1 7 cupyx_scipy_ndimage_rank_2601_1300_2d_reflect_w51_51 [290]
8.34069s 11.069ms - - - - - 25.000MB 2.2057GB/s Device Pageable Tesla V100S-PCI 1 7 [CUDA memcpy DtoH]

Seems to me that the generated elementwise kernel cupyx_scipy_ndimage_rank_2601_1300_2d_reflect_w51_51 costs huge amount of time. Is this reasonable? Also notice that when I increase the image size(because originally it was some large tiff format file), the time cost by cupyx is even slower.

My environment:

root# pip3 list
Package            Version
------------------ ---------------
click              8.0.3
cucim              22.2.0
cupy-cuda114       10.1.0
cycler             0.11.0
fastrlock          0.8
fonttools          4.29.1
glog               0.3.1
imageio            2.15.0
importlib-metadata 4.10.1
kiwisolver         1.3.2
matplotlib         3.5.1
networkx           2.6.3
numpy              1.21.5
opencv-python      4.5.5.62
packaging          21.3
Pillow             9.0.1
pip                22.0.3
pyparsing          3.0.7
python-apt         1.6.5+ubuntu0.7
python-dateutil    2.8.2
python-gflags      3.1.2
PyWavelets         1.2.0
scikit-image       0.19.1
scipy              1.7.3
setuptools         60.8.1
six                1.16.0
tifffile           2021.11.2
typing_extensions  4.0.1
wheel              0.37.1
zipp               3.7.0


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source