'Why are numpy functions so slow on pandas series / dataframes?

Consider a small MWE, taken from another question:

DateTime                Data
2017-11-21 18:54:31     1
2017-11-22 02:26:48     2
2017-11-22 10:19:44     3
2017-11-22 15:11:28     6
2017-11-22 23:21:58     7
2017-11-28 14:28:28    28
2017-11-28 14:36:40     0
2017-11-28 14:59:48     1

The goal is to clip all values with an upper bound of 1. My answer uses np.clip, which works fine.

np.clip(df.Data, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])

Or,

np.clip(df.Data.values, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])

Both of which return the same answer. My question is about the relative performance of these two methods. Consider -

df = pd.concat([df]*1000).reset_index(drop=True)

%timeit np.clip(df.Data, a_min=None, a_max=1)
1000 loops, best of 3: 270 µs per loop

%timeit np.clip(df.Data.values, a_min=None, a_max=1)
10000 loops, best of 3: 23.4 µs per loop

Why is there such a massive difference between the two, just by calling values on the latter? In other words...

Why are numpy functions so slow on pandas objects?



Solution 1:[1]

Just read the source code, it's clear.

def clip(a, a_min, a_max, out=None):
    """a : array_like Array containing elements to clip."""
    return _wrapfunc(a, 'clip', a_min, a_max, out=out)

def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)
    #This situation has occurred in the case of
    # a downstream library like 'pandas'.
    except (AttributeError, TypeError):
        return _wrapit(obj, method, *args, **kwds)

def _wrapit(obj, method, *args, **kwds):
    try:
        wrap = obj.__array_wrap__
    except AttributeError:
        wrap = None
    result = getattr(asarray(obj), method)(*args, **kwds)
    if wrap:
        if not isinstance(result, mu.ndarray):
            result = asarray(result)
        result = wrap(result)
    return result

rectify?

after pandas v0.13.0_ahl1,pandas has it's own implement of clip.

Solution 2:[2]

There are two parts to the performance difference to be aware of here:

  • Python overhead in each library (pandas being extra helpful)
  • Difference in numeric algorithm implementation (pd.clip actually calls np.where)

Running this on a very small array should demonstrate the difference in Python overhead. For numpy, this is understandably very small, however pandas does a lot of checking (null values, more flexible argument processing, etc) before getting to the heavy number crunching. I've tried to show a rough breakdown of the stages which the two codes go through before hitting C code bedrock.

data = pd.Series(np.random.random(100))

When using np.clip on an ndarray, the overhead is simply the numpy wrapper function calling the object's method:

>>> %timeit np.clip(data.values, 0.2, 0.8)        # numpy wrapper, calls .clip() on the ndarray
>>> %timeit data.values.clip(0.2, 0.8)            # C function call

2.22 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.32 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Pandas spends more time checking for edge cases before getting to the algorithm:

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8)   # numpy wrapper, calls .clip() on the Series
>>> %timeit data.clip(lower=0.2, upper=0.8)       # pandas API method
>>> %timeit data._clip_with_scalar(0.2, 0.8)      # lowest level python function

102 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
90.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
73.7 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Relative to overall time, the overhead of both libraries before hitting C code is pretty significant. For numpy, the single wrapping instruction takes as much time to execute as the numeric processing. Pandas has ~30x more overhead just in the first two layers of function calls.

To isolate what is happening at the algorithm level, we should check this on a larger array and benchmark the same functions:

>>> data = pd.Series(np.random.random(1000000))

>>> %timeit np.clip(data.values, 0.2, 0.8)
>>> %timeit data.values.clip(0.2, 0.8)

2.85 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.85 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit np.clip(data, a_min=0.2, a_max=0.8)
>>> %timeit data.clip(lower=0.2, upper=0.8)
>>> %timeit data._clip_with_scalar(0.2, 0.8)

12.3 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.3 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.2 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The python overhead in both cases is now negligible; time for wrapper functions and argument checking is small relative to the calculation time on 1 million values. However there is a 3-4x speed difference which can be attributed to numeric implementation. By investigating a bit in the source code, we see that the pandas implementation of clip actually uses np.where, not np.clip:

def clip_where(data, lower, upper):
    ''' Actual implementation in pd.Series._clip_with_scalar (minus NaN handling). '''
    result = data.values
    result = np.where(result >= upper, upper, result)
    result = np.where(result <= lower, lower, result)
    return pd.Series(result)

def clip_clip(data, lower, upper):
    ''' What would happen if we used ndarray.clip instead. '''
    return pd.Series(data.values.clip(lower, upper))

The additional effort required to check each boolean condition separately before doing a conditional replace would seem to account for the speed difference. Specifying both upper and lower would result in 4 passes through the numpy array (two inequality checks and two calls to np.where). Benchmarking these two functions shows that 3-4x speed ratio:

>>> %timeit clip_clip(data, lower=0.2, upper=0.8)
>>> %timeit clip_where(data, lower=0.2, upper=0.8)

11.1 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.97 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm not sure why the pandas devs went with this implementation. np.clip may be a newer API function that previously required a workaround. There is also a little more to it than I've gone into here, since pandas checks for various case before running the final algorithm, and this is only one of the implementations that may be called.

Solution 3:[3]

The reason why the performance differs is because numpy first tends to search for pandas implementation of the function using getattr than doing the same in builtin numpy functions when a pandas object is passed.

Its not the numpy over the pandas object that is slow, its the pandas version.

When you do

np.clip(pd.Series([1,2,3,4,5]),a_min=None,amax=1)  

_wrapfunc is being called :

# Code from source 
def _wrapfunc(obj, method, *args, **kwds):
    try:
        return getattr(obj, method)(*args, **kwds)

Due to _wrapfunc's getattr method :

getattr(pd.Series([1,2,3,4,5]),'clip')(None, 1)
# Equivalent to `pd.Series([1,2,3,4,5]).clip(lower=None,upper=1)`
# 0    1
# 1    1
# 2    1
# 3    1
# 4    1
# dtype: int64

If you go through the pandas implementation there is a lot of pre checking work that is done. Its the reason why the functions which has the pandas implementation done via numpy has such difference in speed.

Not only clip, functions like cumsum,cumprod,reshape,searchsorted,transpose and much more uses pandas version of them than numpy when you pass them a pandas object.

It might appear numpy is doing the work over those objects but under the hood its the pandas function.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3