'Pandas pipeline with conditions
I have a Pandas pipeline and would like to use either count
or mean
function based on a boolean variable.
I came out with the following solution:
import pandas as pd # version 1.0.4
my_boolean = False
df = pd.read_csv(
'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
pipeline = (df.groupby('species')['sepal_length']
# .count() or .mean() based on value of my_boolean variable
.pipe(pd.core.groupby.GroupBy.mean if my_boolean
else pd.core.groupby.GroupBy.size)
)
I had to change pd.core.groupby.GroupBy.count
by pd.core.groupby.GroupBy.size
as I encoutered a NotImplementedError
for count
function.
Is there a better way to handle conditions within pipelines?
Thanks
Solution 1:[1]
Change to lambda
and it seems to work fine:
my_boolean = False
df = pd.read_csv(
'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
pipeline = (df.groupby('species')['sepal_length']
# .count() or .mean() based on value of my_boolean variable
.pipe(lambda g: g.mean() if my_boolean
else g.count())
)
Solution 2:[2]
You can also monkey patch a specific function that will execute the pipe
only if a condition is met.
import time
import pandas as pd
from pandas.core import common as com
def pipeif(self, func: 'Callable[..., T] | tuple[Callable[..., T], str]',
*args,
cond:'bool'=True,
**kwargs,
):
if cond:
return com.pipe(self, func, *args, **kwargs)
else:
return self
pd.DataFrame.pipeif = pipeif
A small example with a function that print current time and is executed only twice (as the second pipeif has a False
cond argument).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Quang Hoang |
Solution 2 | Adrien Pacifico |