'Idiomatic replacement of empty string '' with pl.Null (null) in polars

I have a polars DataFrame with a number of Series that look like:

pl.Series(['cow', 'cat', '', 'lobster', ''])

and I'd like them to be

pl.Series(['cow', 'cat', pl.Null, 'lobster', pl.Null])

A simple string replacement won't work since pl.Null is not of type PyString:

pl.Series(['cow', 'cat', '', 'lobster', '']).str.replace('', pl.Null)

What's the idiomatic way of doing this for a Series/DataFrame in polars?



Solution 1:[1]

Series

For a single Series, you can use the set method.

import polars as pl
my_series = pl.Series(['cow', 'cat', '', 'lobster', ''])
my_series.set(my_series.str.lengths() == 0, None)
shape: (5,)
Series: '' [str]
[
        "cow"
        "cat"
        null
        "lobster"
        null
]

DataFrame

For DataFrames, I would suggest using when/then/otherwise. For example, with this data:

df = pl.DataFrame({
    'str1': ['cow', 'dog', "", 'lobster', ''],
    'str2': ['', 'apple', "orange", '', 'kiwi'],
    'str3': ['house', '', "apartment", 'condo', ''],
})
df
shape: (5, 3)
????????????????????????????????
? str1    ? str2   ? str3      ?
? ---     ? ---    ? ---       ?
? str     ? str    ? str       ?
????????????????????????????????
? cow     ?        ? house     ?
????????????????????????????????
? dog     ? apple  ?           ?
????????????????????????????????
?         ? orange ? apartment ?
????????????????????????????????
? lobster ?        ? condo     ?
????????????????????????????????
?         ? kiwi   ?           ?
????????????????????????????????

We can run a replacement on all string columns as follows:

df.with_columns([
    pl.when(pl.col(pl.Utf8).str.lengths() ==0)
    .then(None)
    .otherwise(pl.col(pl.Utf8))
    .keep_name()
])
shape: (5, 3)
????????????????????????????????
? str1    ? str2   ? str3      ?
? ---     ? ---    ? ---       ?
? str     ? str    ? str       ?
????????????????????????????????
? cow     ? null   ? house     ?
????????????????????????????????
? dog     ? apple  ? null      ?
????????????????????????????????
? null    ? orange ? apartment ?
????????????????????????????????
? lobster ? null   ? condo     ?
????????????????????????????????
? null    ? kiwi   ? null      ?
????????????????????????????????

The above should be fairly performant.

If you only want to replace empty strings with null on certain columns, you can provide a list:

only_these = ['str1', 'str2']
df.with_columns([
    pl.when(pl.col(only_these).str.lengths() == 0)
    .then(None)
    .otherwise(pl.col(only_these))
    .keep_name()
])
shape: (5, 3)
????????????????????????????????
? str1    ? str2   ? str3      ?
? ---     ? ---    ? ---       ?
? str     ? str    ? str       ?
????????????????????????????????
? cow     ? null   ? house     ?
????????????????????????????????
? dog     ? apple  ?           ?
????????????????????????????????
? null    ? orange ? apartment ?
????????????????????????????????
? lobster ? null   ? condo     ?
????????????????????????????????
? null    ? kiwi   ?           ?
????????????????????????????????

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cbilot