'RAPIDS/NUMBA: Faster way to parallelize a for-loop on small data?

If I have data that easily fits into memory, but I need to iterate over it hundreds or thousands of times, is there a faster way?

For instance, if I have 400k datapoints and I need to iterate over it with 1000 filters. It is 4-10 times slower to to do a for-loop than it is to do a single operation on data that is length 400k*1000.

#setup
import cudf
import numpy as np
import cupy as cp
from numba import cuda
cp.seed = 42

signal_ranges = []
signal_len = 1000
data_size = 400000
for signal in range(signal_len):
    s_low = cp.random.rand(1, dtype='float')
    def get_high():
        return cp.random.rand(1, dtype='float')
    s_high = 0
    while s_high <= s_low:
        s_high = get_high()
    
    signal_ranges.append((s_low,s_high))

EXAMPLE 1 - length 400k*1000

@cuda.jit
def filter_signal(in_col, s1, s2, out):
    i = cuda.grid(1)
    if i < in_col.size: # boundary guard
        out[i] = 1 if in_col[i] <= s1 else -1 if in_col[i] >= s2 else 0

%%timeit -r 1 
s1 = float(signal_ranges[0][0])
s2 = float(signal_ranges[0][1])
cu_df_big = cudf.DataFrame(cp.random.rand((data_size*signal_len)), columns=['in1'])
cu_df_big['0'] = 0
size = len(cu_df_big)

filter_signal.forall(size)(cu_df_big['in1'], s1, s2, cu_df_big['0'])

*314ms*


EXAMPLE 2 - 400k iterated 1000 times

@cuda.jit
def filter_signal(in_col, s1, s2, out):
    i = cuda.grid(1)
    if i < in_col.size: # boundary guard
        out[i] = 1 if in_col[i] <= s1 else -1 if in_col[i] >= s2 else 0

%%timeit -r 1 
cu_df = cudf.DataFrame(cp.random.rand((data_size)), columns=['in1'])
size = len(cu_df)
col_id = 0
for sigs in signal_ranges:
    s1 = float(sigs[0])
    s2 = float(sigs[1])
    col = str(col_id)
    cu_df[col] = 0
    filter_signal.forall(size)(cu_df['in1'], s1, s2, cu_df[col])
    col_id += 1

*2.3secs*


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source