'groupby result is inconsistent in polars
in the user guider, there is an example:
from datetime import date
def compute_age() -> pl.Expr:
return date(2021, 1, 1).year - pl.col("birthday").dt.year()
def avg_birthday(gender: str) -> pl.Expr:
return compute_age().filter(
pl.col("gender") == gender
).mean().alias(f"avg {gender} birthday")
q = (
datasetn.lazy()
.groupby(["state"])
.agg(
[
avg_birthday("M"),
avg_birthday("F"),
(pl.col("gender") == "M").count().alias("# male"),
(pl.col("gender") == "F").sum().alias("# female"),
]
)
)
df = q.collect()
df
the result is inconsistent. for example :run by the first time:
| state | avg M birthday | avg F birthday | # male | # female |
|---|---|---|---|---|
| str | f64 | f64 | u32 | u32 |
| ME | 58.0 | 67.5 | 4 | 2 |
| AZ | 60.375 | 59.666667 | 11 | 3 |
| VT | 78.333333 | null | 3 | 0 |
| GU | 40.0 | null | 1 | 0 |
| KS | 54.2 | 41.0 | 6 | 1 |
| LA | 58.0 | 40.0 | 8 | 1 |
for example :run by the second time:
| state | avg M birthday | avg F birthday | # male | # female |
|---|---|---|---|---|
| str | f64 | f64 | u32 | u32 |
| NC | 56.181818 | 69.0 | 15 | 4 |
| MA | 60.0 | 56.25 | 11 | 4 |
| CO | 57.428571 | 49.5 | 9 | 2 |
| IA | 70.0 | 52.75 | 6 | 4 |
| CA | 57.323529 | 67.75 | 54 | 20 |
| ME | 58.0 | 67.5 | 4 | 2 |
| NV | 55.5 | 61.75 | 6 | 4 |
I guess it may cause by paralleling? Is this a bug or a feature? How to keep the result consistent?
Solution 1:[1]
Use maintain_order=True on groupby.
maintain_order: Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.
(As an aside, I'm not sure where you got the squeeze=True argument in your post.)
.groupby(["state"], squeeze=True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | cbilot |
