'Python Pandas, apply function

I am trying to use apply to avoid an iterrows() iterator in a function:

However that pandas method is poorly documented and I can't find example on how to use it, except for the lame .apply(sq.rt) in the documentation... No example on how to use arguments etc...

Anyway, here a toy example on what I try to do.

In my understanding apply will actually do the same as iterrows(), ie, iterate (over the rows if axis=0). On each iteration the input x of the function should be the row iterated over. However the error messages I keep receiving sort of disprove that assumption...

grid = np.random.rand(5,2)
df = pd.DataFrame(grid)

def multiply(x):
    x[3]=x[0]*x[1]

df = df.apply(multiply, axis=0)

The example above returns an empty df. Can anyone shed some light on my misunderstanding?



Solution 1:[1]

import pandas as pd
import numpy as np

grid = np.random.rand(5,2)
df = pd.DataFrame(grid)

def multiply(x):
    return x[0]*x[1]

df['multiply'] = df.apply(multiply, axis = 1)
print(df)

Results in:

          0         1  multiply
0  0.550750  0.713054  0.392715
1  0.061949  0.661614  0.040987
2  0.472134  0.783479  0.369907
3  0.827371  0.277591  0.229670
4  0.961102  0.137510  0.132162

Explanation:

The function you are applying, needs to return a value. You are also applying this to each row, not column. The axis parameter you passed was incorrect in this regard.

Finally, notice that I am setting this equal to the 'multiply' column outside of my function. You can easily change this to be df[3] = ... like you have and get a dataframe like this:

          0         1         3
0  0.550750  0.713054  0.392715
1  0.061949  0.661614  0.040987
2  0.472134  0.783479  0.369907
3  0.827371  0.277591  0.229670
4  0.961102  0.137510  0.132162

Solution 2:[2]

It should be noted that you can use lambda functions as well. See their documentation Apply

For your example, you can run:

df['multiply'] = df.apply(lambda row: row[0] * row[1], axis = 1)

which produces the same output as @Andy

This can be useful if your function is in the form of

def multiply(a,b):
    return a*b

df['multiply'] = df.apply(lambda row: multiply(row[0] ,row[1]), axis = 1)

More examples in the section Enhancing Performance

Solution 3:[3]

When apply-ing a function, you need that function to return the result for that operation over the column/row. You are getting None because multiply doesn't return, evidently. That is, apply should be returning a result between particular values, not doing the assignment itself.

You're also iterating over the wrong axis, here. Your current code takes the first and second element of each column and multiplies them together.

A correct multiply function:

def multiply(x):
    return x[0]*x[1]

df[3] = df.apply(multiply, 'columns')

With that being said, you can do much better than apply here, as it is not a vectorized operation. Just multiply the columns together directly.

df[3] = df[0]*df[1]

In general, you should avoid apply when possible as it is not much more than a loop itself under the hood.

Solution 4:[4]

One of the rules of Pandas Zen says: always try to find a vectorized solution first.

.apply(..., axis=1) is not vectorized!

Consider alternatives:

In [164]: df.prod(axis=1)
Out[164]:
0    0.770675
1    0.539782
2    0.318027
3    0.597172
4    0.211643
dtype: float64

In [165]: df[0] * df[1]
Out[165]:
0    0.770675
1    0.539782
2    0.318027
3    0.597172
4    0.211643
dtype: float64

Timing against 50.000 rows DF:

In [166]: df = pd.concat([df] * 10**4, ignore_index=True)

In [167]: df.shape
Out[167]: (50000, 2)

In [168]: %timeit df.apply(multiply, axis=1)
1 loop, best of 3: 6.12 s per loop

In [169]: %timeit df.prod(axis=1)
100 loops, best of 3: 6.23 ms per loop

In [170]: def multiply_vect(x1, x2):
     ...:     return x1*x2
     ...:

In [171]: %timeit multiply_vect(df[0], df[1])
1000 loops, best of 3: 604 µs per loop

Conclusion: use .apply() as a very last resort (i.e. when nothing else helps)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andy
Solution 2 vinzee
Solution 3 miradulo
Solution 4 MaxU - stop genocide of UA