'Questions about Polars groupby
Q1: In polars-rust, when you do
gourpby.agg, we can usehead(10)to get the first 10 elements in a col. But if the groups have different length and I need to get first20%elements in each group like 0-24 elements in a 120 elements group. how to make it work?Q2: with a dataframe sample like below, my goal is to loop the dataframe. Beacuse polars is column major, so I downcasted df into serval ChunkedArrays and iterated via iter().zip().I found it is faster than the same action after goupby(col("date")) which is loop some list elemnts. How is that? In my opinion, the length of df is shorter after groupby, which means a shorter loop.
| Date | Stock | Price |
|---|---|---|
| 2010-01-01 | IBM | 1000 |
| 2010-01-02 | IBM | 1001 |
| 2010-01-03 | IBM | 1002 |
| 2010-01-01 | AAPL | 2900 |
| 2010-01-02 | AAPL | 2901 |
| 2010-01-03 | AAPL | 2902 |
Solution 1:[1]
I don't really understand your 2nd question. Maybe you can create another question with a small example.
I will answer the 1st question:
we can use head(10) to get the first 10 elements in a col. But if the groups have different length and I need to get first 20% elements in each group like 0-24 elements in a 120 elements group. how to make it work?
We can use expressions to take a head(n) where n = 0.2 group_size.
df = pl.DataFrame({
"groups": ["a"] * 10 + ["b"] * 20,
"values": range(30)
})
(df.groupby("groups")
.agg(pl.all().head(pl.count() * 0.2))
.explode(pl.all().exclude("groups"))
)
which outputs:
shape: (6, 2)
???????????????????
? groups ? values ?
? --- ? --- ?
? str ? i64 ?
???????????????????
? a ? 0 ?
???????????????????
? a ? 1 ?
???????????????????
? b ? 10 ?
???????????????????
? b ? 11 ?
???????????????????
? b ? 12 ?
???????????????????
? b ? 13 ?
???????????????????
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ritchie46 |
