'RAPIDS/NUMBA: Faster way to parallelize a for-loop on small data?
If I have data that easily fits into memory, but I need to iterate over it hundreds or thousands of times, is there a faster way?
For instance, if I have 400k datapoints and I need to iterate over it with 1000 filters. It is 4-10 times slower to to do a for-loop than it is to do a single operation on data that is length 400k*1000.
#setup
import cudf
import numpy as np
import cupy as cp
from numba import cuda
cp.seed = 42
signal_ranges = []
signal_len = 1000
data_size = 400000
for signal in range(signal_len):
s_low = cp.random.rand(1, dtype='float')
def get_high():
return cp.random.rand(1, dtype='float')
s_high = 0
while s_high <= s_low:
s_high = get_high()
signal_ranges.append((s_low,s_high))
EXAMPLE 1 - length 400k*1000
@cuda.jit
def filter_signal(in_col, s1, s2, out):
i = cuda.grid(1)
if i < in_col.size: # boundary guard
out[i] = 1 if in_col[i] <= s1 else -1 if in_col[i] >= s2 else 0
%%timeit -r 1
s1 = float(signal_ranges[0][0])
s2 = float(signal_ranges[0][1])
cu_df_big = cudf.DataFrame(cp.random.rand((data_size*signal_len)), columns=['in1'])
cu_df_big['0'] = 0
size = len(cu_df_big)
filter_signal.forall(size)(cu_df_big['in1'], s1, s2, cu_df_big['0'])
*314ms*
EXAMPLE 2 - 400k iterated 1000 times
@cuda.jit
def filter_signal(in_col, s1, s2, out):
i = cuda.grid(1)
if i < in_col.size: # boundary guard
out[i] = 1 if in_col[i] <= s1 else -1 if in_col[i] >= s2 else 0
%%timeit -r 1
cu_df = cudf.DataFrame(cp.random.rand((data_size)), columns=['in1'])
size = len(cu_df)
col_id = 0
for sigs in signal_ranges:
s1 = float(sigs[0])
s2 = float(sigs[1])
col = str(col_id)
cu_df[col] = 0
filter_signal.forall(size)(cu_df['in1'], s1, s2, cu_df[col])
col_id += 1
*2.3secs*
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
