'groupby result is inconsistent in polars

in the user guider, there is an example:

from datetime import date

def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()

def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(
            pl.col("gender") == gender
        ).mean().alias(f"avg {gender} birthday")


q = (
    datasetn.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"), 
            avg_birthday("F"),
            (pl.col("gender") == "M").count().alias("# male"), 
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
)
df = q.collect()
df

the result is inconsistent. for example ：run by the first time:

state	avg M birthday	avg F birthday	# male	# female
str	f64	f64	u32	u32
ME	58.0	67.5	4	2
AZ	60.375	59.666667	11	3
VT	78.333333	null	3	0
GU	40.0	null	1	0
KS	54.2	41.0	6	1
LA	58.0	40.0	8	1

for example ：run by the second time:

state	avg M birthday	avg F birthday	# male	# female
str	f64	f64	u32	u32
NC	56.181818	69.0	15	4
MA	60.0	56.25	11	4
CO	57.428571	49.5	9	2
IA	70.0	52.75	6	4
CA	57.323529	67.75	54	20
ME	58.0	67.5	4	2
NV	55.5	61.75	6	4

I guess it may cause by paralleling? Is this a bug or a feature? How to keep the result consistent?

python python-polars

Solution 1:^[1]

Use maintain_order=True on groupby.

maintain_order: Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.

(As an aside, I'm not sure where you got the squeeze=True argument in your post.)

.groupby(["state"], squeeze=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	cbilot

'groupby result is inconsistent in polars

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]