'groupby result is inconsistent in polars

in the user guider, there is an example:

from datetime import date

def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()

def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(
            pl.col("gender") == gender
        ).mean().alias(f"avg {gender} birthday")


q = (
    datasetn.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"), 
            avg_birthday("F"),
            (pl.col("gender") == "M").count().alias("# male"), 
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
)
df = q.collect()
df

the result is inconsistent. for example :run by the first time:

state avg M birthday avg F birthday # male # female
str f64 f64 u32 u32
ME 58.0 67.5 4 2
AZ 60.375 59.666667 11 3
VT 78.333333 null 3 0
GU 40.0 null 1 0
KS 54.2 41.0 6 1
LA 58.0 40.0 8 1

for example :run by the second time:

state avg M birthday avg F birthday # male # female
str f64 f64 u32 u32
NC 56.181818 69.0 15 4
MA 60.0 56.25 11 4
CO 57.428571 49.5 9 2
IA 70.0 52.75 6 4
CA 57.323529 67.75 54 20
ME 58.0 67.5 4 2
NV 55.5 61.75 6 4

I guess it may cause by paralleling? Is this a bug or a feature? How to keep the result consistent?



Solution 1:[1]

Use maintain_order=True on groupby.

maintain_order: Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.

(As an aside, I'm not sure where you got the squeeze=True argument in your post.)

.groupby(["state"], squeeze=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cbilot