'the real order of sort().groupby() in polars?
code in user guider is as follows:
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday")
.groupby(["state"])
.agg(
[
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
]
)
.limit(5)
)
df = q.collect()
df
1 May the real order of sort().groupby() execute groupby first and then execute sort? ,which is similar to pandas?
answer by @tvashtar about this question provides some tips.
Solution 1:[1]
The logical order of a polars query is the order you read it from top to bottom.
q = (
dataset.lazy()
.sort("birthday")
.groupby(["state"])
.agg(
[
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
]
)
.limit(5)
)
This snippets has the following order of operations sort -> groupby/agg -> limit.
Note that polars may choose to execute the query in a different order IFF the outcome is the same. This might be done for performance reasons.
1 May the real order of sort().groupby() execute groupby first and then execute sort? ,which is similar to pandas?
I don't think that pandas does this. The result would be incorrect if it did. The outcome of a first aggregation changes by sorting, so if we would decide to do the sort after the groupby operation, we would have changed the outcome of the query and thus this optimization is invalid.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ritchie46 |
