'How to increase values of polars dataframe column by index

I have a data frame as follow

┌────────────┬──────────┬──────────┬──────────┬──────────┐
│ time       ┆ open     ┆ high     ┆ low      ┆ close    │
│ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ i64        ┆ f64      ┆ f64      ┆ f64      ┆ f64      │
╞════════════╪══════════╪══════════╪══════════╪══════════╡
│ 1649016000 ┆ 46405.49 ┆ 47444.11 ┆ 46248.84 ┆ 46407.35 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1649030400 ┆ 46407.36 ┆ 46461.14 ┆ 45744.77 ┆ 46005.44 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1649044800 ┆ 46005.43 ┆ 46293.38 ┆ 45834.39 ┆ 46173.99 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1649059200 ┆ 46174.0  ┆ 46287.97 ┆ 45787.0  ┆ 46160.09 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...      ┆ ...      ┆ ...      ┆ ...      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1653278400 ┆ 30171.32 ┆ 30670.51 ┆ 30101.07 ┆ 30457.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1653292800 ┆ 30457.01 ┆ 30616.18 ┆ 30281.89 ┆ 30397.11 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1653307200 ┆ 30397.12 ┆ 30625.98 ┆ 29967.07 ┆ 30373.53 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1653321600 ┆ 30373.53 ┆ 30529.9  ┆ 30042.09 ┆ 30121.02 │
└────────────┴──────────┴──────────┴──────────┴──────────┘

I want to count how many times each price (low and high) were local minimum/maximum in a window range of 2 to 50.

first I add two columns for count of being local min/max for each row and fill by zeros

raw_data["lmin_count"] = np.zeros(len(raw_data), dtype=np.int16)
raw_data["lmax_count"] = np.zeros(len(raw_data), dtype=np.int16)

then I iterate window length from 2 to 50 and find index of each local min/max by using:

for _order in range(2, 51):
    local_minima = argrelextrema(raw_data["low"].to_numpy(), np.less, order=_order)[0]
    local_maxima = argrelextrema(raw_data["high"].to_numpy(), np.greater, order=_order)[0]

which order is window length.

and in each iteration over window length I want to increase value of lmin_count and lmax_count by indices found in local_minima and local_maxima I tried increasing value by this code:

if len(local_minima) > 1:
    raw_data[local_minima,5] += 1
if len(local_maxima) > 1:
    raw_data[local_minima,6] += 1

which local_minima and local_maxima are array of indices and 5,6 are index of lmin_count and lmax_count columns.

but got error not implemented.

So what is the best way to increase (or assign) value of column by row indices?

Update 2022/05/24

As answers were very helpful now I have other issues. I changed my code as follow:

min_expr_list = [
    (
        pl.col("price").rolling_min(
            window_size=_order * 2 + 1, min_periods=_order + 2, center=True
        )
        == pl.col("price")
    ).cast(pl.UInt32)
    for _order in range(200, 1001)
]

max_expr_list = [
    (
        pl.col("price").rolling_max(
            window_size=_order * 2 + 1, min_periods=_order + 2, center=True
        )
        == pl.col("price")
    ).cast(pl.UInt32)
    for _order in range(200, 1001)
]
raw_data = raw_data.with_columns(
    [
        pl.sum(min_expr_list).alias("min_freq"),
        pl.sum(max_expr_list).alias("max_freq"),
    ]
)

first: is it possible to merge both min_expr_list and max_expr_list into one list? and if it is possible, in with_columns expression how can I add separate columns based on each element of list?

another issue I am facing is memory usage of this approach. In previous example _order were limited but in action it is more wider than example.

currently I have datasets with millions of records (some of them have more than 10 million records) and _orders range can be from 2 to 1500 so calculating needs lots of GB of ram.

is there any better way to do that?

and one more side problem. when increasing _order to more than 1000 it seems it doesn't work. is there any limitation in source code?



Solution 1:[1]

You write very imperative code which is not really idiomatic polars. You typically should not even know where the index of a value is. Instead you assign by conditions, for instance with when(condition) -> then(value) -> otherwise(value) expressions.

Your condition in a when can still refer to an index. This snippet for instance is equal to assigning to a specific index, but then written more functional:

pl.DataFrame({
    "letters": ["a", "b", "c", "d"]
}).with_column(
    # use a condition to determine the index location
    pl.when(pl.arange(0, pl.count()) == 2)
      .then("idx_2")
      .otherwise("letters").alias("letters")
)
shape: (4, 1)
???????????
? letters ?
? ---     ?
? str     ?
???????????
? letters ?
???????????
? letters ?
???????????
? idx_2   ?
???????????
? letters ?
???????????

Your intent, counting local minima/maxima

To help you in your cause, I'd like to show how you can find your local minima in an idiomatic polars way.

The local minima/maxima can be found by:

  • taking the derivative dy/dx of a function x.
  • computing the sign of that derivative tells us where the functions slope is increasing and decreasing.
  • if we take the derivative of the sign of dy/dx we know where the signs are changing and thus where we have local minima/maxima.

Let's try this on a dummy DataFrame.

df = pl.DataFrame({
    "x": [8, 4, 2, 7, 9, 6, 3, 0]
})

# find the local minima/maxima
df = df.with_columns([
    (pl.col("x").diff().sign().diff().shift(-1) == -2).alias("local_maximum"),
    (pl.col("x").diff().sign().diff().shift(-1) == 2).alias("local_minimum")
])
print(df)
shape: (8, 3)
???????????????????????????????????????
? x   ? local_maximum ? local_minimum ?
? --- ? ---           ? ---           ?
? i64 ? bool          ? bool          ?
???????????????????????????????????????
? 8   ? false         ? false         ?
???????????????????????????????????????
? 4   ? false         ? false         ?
???????????????????????????????????????
? 2   ? false         ? true          ?
???????????????????????????????????????
? 7   ? false         ? false         ?
???????????????????????????????????????
? 9   ? true          ? false         ?
???????????????????????????????????????
? 6   ? false         ? false         ?
???????????????????????????????????????
? 3   ? false         ? false         ?
???????????????????????????????????????
? 0   ? false         ? false         ?
???????????????????????????????????????

Next we can take the cumulative sum to count the total seen local minima and maxima.

df.with_columns([
    pl.col("local_maximum").cumsum().alias("local_max_count"),
    pl.col("local_minimum").cumsum().alias("local_min_count")
])
shape: (8, 5)
???????????????????????????????????????????????????????????????????????????
? x   ? local_maximum ? local_minimum ? local_max_count ? local_min_count ?
? --- ? ---           ? ---           ? ---             ? ---             ?
? i64 ? bool          ? bool          ? u32             ? u32             ?
???????????????????????????????????????????????????????????????????????????
? 8   ? false         ? false         ? 0               ? 0               ?
???????????????????????????????????????????????????????????????????????????
? 4   ? false         ? false         ? 0               ? 0               ?
???????????????????????????????????????????????????????????????????????????
? 2   ? false         ? true          ? 0               ? 1               ?
???????????????????????????????????????????????????????????????????????????
? 7   ? false         ? false         ? 0               ? 1               ?
???????????????????????????????????????????????????????????????????????????
? 9   ? true          ? false         ? 1               ? 1               ?
???????????????????????????????????????????????????????????????????????????
? 6   ? false         ? false         ? 1               ? 1               ?
???????????????????????????????????????????????????????????????????????????
? 3   ? false         ? false         ? 1               ? 1               ?
???????????????????????????????????????????????????????????????????????????
? 0   ? false         ? false         ? 1               ? 1               ?
???????????????????????????????????????????????????????????????????????????

I hope this helps nudge you in the right direction.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1