'What's the difference of creating new column between df.with_column and df['new_col'] = func(df['old_col']) in Polars

Sorry if this question's asked before, but I've never seen it yet.

I realised Polars can create a new column with

def func(series):
   """sample function """
   return series

df.with_column(
   pl.lit(func(df['old_col'])).alias('new_col')
)

yet, it also can be written as:

df['new_col'] = func(df['old_col'])

Is this performance related?

Cheers!

python python-polars

Solution 1:^[1]

Forgive my long answer in advance. But I want to show the benefits and reasons why coding "the Polars way" is worth the learning curve - to achieve the shockingly fast performance that we Polars users have come to expect.

Transliterated Pandas-style coding

The following code, for lack of a better term, is "transliterated Pandas" code.

df['new_col'] = func(df['old_col'])

That is, it resembles Pandas-style code. And it certainly runs in Polars. It may even yield acceptable performance in some situations.

But it doesn't embrace the Polars way of expressing your calculations so that the Polars engine can easily parallelize computations across multiple cores. Nor does this code allow for query optimization in Lazy mode. (Indeed, the use of a Python function likely will subject the code to the Python GIL, forcing code to run in single-threaded fashion.)

Most who use Polars have come to expect much better performance than single-threaded Python bytecode performance. (Indeed, that's the entire point of Polars.)

An Example

Let's take a closer look. Let's use the following dataset. Set nbr_obs high enough to challenge your system.

import polars as pl

nbr_obs = 1_000_000
df = pl.DataFrame(
    {
        'date1': pl.repeat('2001-01-01', nbr_obs, eager=True),
        'date2': pl.repeat('2002-01-01', nbr_obs, eager=True),
        'date3': pl.repeat('2003-01-01', nbr_obs, eager=True),
        'date4': pl.repeat('2004-01-01', nbr_obs, eager=True),
        'date5': pl.repeat('2005-01-01', nbr_obs, eager=True),
    }
)

shape: (1000000, 5)
??????????????????????????????????????????????????????????????????
? date1      ? date2      ? date3      ? date4      ? date5      ?
? ---        ? ---        ? ---        ? ---        ? ---        ?
? str        ? str        ? str        ? str        ? str        ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? ...        ? ...        ? ...        ? ...        ? ...        ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????

Transliterated Pandas-style code

Now let's create a Python function to parse these dates.

from datetime import datetime
def parse_it(series):
    return [datetime.strptime(date_str, "%Y-%m-%d") for date_str in series]

And let's run this using "transliterated Pandas" code. Depending on your OS, watch the CPU usage on your machine as this code runs.

df['date1_parse'] = parse_it(df['date1'])
df['date2_parse'] = parse_it(df['date2'])
df['date3_parse'] = parse_it(df['date3'])
df['date4_parse'] = parse_it(df['date4'])
df['date5_parse'] = parse_it(df['date5'])

What do you observe? You should observe painfully single-threaded performance. (Polars users avoid this at all costs, especially since many of us are using large datasets and multi-core computing platforms.)

The loss of Lazy Mode

Now, let's try to switch to Polars Lazy mode so that we can harness Polars' powerful query optimization engine.

df = df.lazy()
df['date1_parse'] = parse_it(df['date1'])

>>> df['date1_parse'] = parse_it(df['date1'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'LazyFrame' object is not subscriptable

Oops, we cannot use transliterated Pandas-style code in Lazy mode. That's a big loss of potential performance, in general.

The Polars Way

The way to harness Polars' powerful engine is to master the Expression syntax. We code our goals in terms of Expressions. If done well, Polars can parallelize our code (and avoid the Python GIL). The result is shockingly better performance.

Now, let's recreate our dataset, and parse dates using Polars Expressions. Notice that I've abandoned using a Python function altogether, and have stayed "inside the Polars API". That is, I've expressed everything I need using Expressions.

df = pl.DataFrame(
    {
        'date1': pl.repeat('2001-01-01', nbr_obs, eager=True),
        'date2': pl.repeat('2002-01-01', nbr_obs, eager=True),
        'date3': pl.repeat('2003-01-01', nbr_obs, eager=True),
        'date4': pl.repeat('2004-01-01', nbr_obs, eager=True),
        'date5': pl.repeat('2005-01-01', nbr_obs, eager=True),
    }
)

df.with_columns([
    pl.col("^date.*$").str.strptime(pl.Date, "%Y-%m-%d")
])

shape: (1000000, 5)
??????????????????????????????????????????????????????????????????
? date1      ? date2      ? date3      ? date4      ? date5      ?
? ---        ? ---        ? ---        ? ---        ? ---        ?
? date       ? date       ? date       ? date       ? date       ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? ...        ? ...        ? ...        ? ...        ? ...        ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????
? 2001-01-01 ? 2002-01-01 ? 2003-01-01 ? 2004-01-01 ? 2005-01-01 ?
??????????????????????????????????????????????????????????????????

This time, what did you notice about the performance? Massively parallel. And shocking performance.

Does this (long-winded) answer provide some insight as why it's best to master the Expressions syntax, and avoid "transliterated Pandas-style" code?

You can find more information in the Polars Cookbook. There are lots of great examples of using Expressions and coding "the Polars way".

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1