'Efficiency of multiple chained str transformation and alternatives
I'm wanting to change a dataframe column so the values are lower case and also have their whitespace stripped.
For this I used chained str transformations.
df.loc[:, column] = df.loc[:, column].str.lower().str.strip()
The above snippet works, but it looks quite messy as I have to use .str. twice - is there a better/more efficient solution?
Solution 1:[1]
You can use a list comprehension:
df['col2'] = [x.lower().strip() for x in df['col']]
Doing this can be faster than chaining multiple str:
%%timeit
df['col2'] = df['col'].str.strip().str.lower()
# 344 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['col2'] = [x.lower().strip() for x in s]
# 182 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
used input (1M rows):
df = pd.DataFrame({'col': [' aBc DeF ']*1000000})
NB. I used strip before lower in the comparison as this is faster than lower, then strip.
non string values in the input
If there are non string values, the list comprehension will fail.
There are several possibilities to handle this.
Using a systematic check:
df['col2'] = [x if pd.isna(x) else x.lower().strip() for x in df['col']]
this will however add an extra burden and will probably be slower compared to only two chained str (but might still be faster than more than 2 chained str).
Using a custom function and try/except:
try/except has a very low cost if no error is triggered, which might make it a viable solution when the number of non-string values is low. In this case one would use:
def f(x):
try:
return x.lower().strip()
except: # add exhaustive expected exceptions if known
return x
df['col2'] = [f(x) for x in df['col']]
timings (1M rows with 1% non-string values):
df = pd.DataFrame({'col': np.random.choice([' aBc DeF ', float('nan')],
p=[0.99, 0.01], size=1000000)})
%%timeit
df['col2'] = df['col'].str.strip().str.lower()
349 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['col2'] = [x if pd.isna(x) else x.lower().strip() for x in df['col']]
442 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['col2'] = [f(x) for x in df['col']]
235 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:[2]
df = pd.DataFrame(['ABCD EFG', 'ABC DE'], columns=['col1'])
df['col1'] = df.apply(lambda row: row['col1'].lower().strip(), axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | yllwpr |
