'Pandas pipeline with conditions

I have a Pandas pipeline and would like to use either count or mean function based on a boolean variable.

I came out with the following solution:

import pandas as pd # version 1.0.4

my_boolean = False
df = pd.read_csv(
    'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

pipeline = (df.groupby('species')['sepal_length']
            # .count() or .mean() based on value of my_boolean variable
              .pipe(pd.core.groupby.GroupBy.mean if my_boolean 
                        else pd.core.groupby.GroupBy.size)
            )

I had to change pd.core.groupby.GroupBy.count by pd.core.groupby.GroupBy.size as I encoutered a NotImplementedError for count function.

Is there a better way to handle conditions within pipelines?

Thanks



Solution 1:[1]

Change to lambda and it seems to work fine:

my_boolean = False
df = pd.read_csv(
    'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

pipeline = (df.groupby('species')['sepal_length']
            # .count() or .mean() based on value of my_boolean variable
              .pipe(lambda g: g.mean() if my_boolean 
                        else g.count())
            )

Solution 2:[2]

You can also monkey patch a specific function that will execute the pipe only if a condition is met.

import time

import pandas as pd
from pandas.core import common as com

def pipeif(self, func: 'Callable[..., T] | tuple[Callable[..., T], str]',
    *args,
    cond:'bool'=True,
    **kwargs,    
        ):
    if cond:
        return com.pipe(self, func, *args, **kwargs)
    else:
        return self

pd.DataFrame.pipeif = pipeif

A small example with a function that print current time and is executed only twice (as the second pipeif has a False cond argument).

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Quang Hoang
Solution 2 Adrien Pacifico