'Replace column value in pandas for all possible combinations of conditions
I have a dataframe (df) that I want to manipulate as follows
date,string
2002-01-01,ABAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,BBAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,BABA
2002-01-02,ABBB
2002-01-02,DCDC
2002-01-02,AABB
- Check for all cases where the string contains 'D' and 'C' in whatever order and change to 'DDDD'
- Also check for all occurrences where the string contains 'A' and 'B' and change to 'AAAA'
As a simple example, the below code will produce the expected output below.
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.read_csv('df.csv')
df["string"].mask(df["string"] == 'DCDC', 'DDDD', inplace=True)
df["string"].mask(df["string"] == 'ABAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BBAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BABA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'ABBB', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'AABB', 'AAAA', inplace=True)
print(df)
Expected Output:
date,string
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,AAAA
2002-01-02,AAAA
2002-01-02,DDDD
2002-01-02,AAAA
However, the above is too hard coded. I am thinking, I will first something like below to first extract all instances that need to be replaced:
letters_dc = ['D','C']
letters_ab = ['A','B']
contains_dc = [df['symbol'].str.contains(i) for i in letters_dc]
contains_ab = [df['symbol'].str.contains(i) for i in letters_ab]
resul = df[np.all(contains_dc, axis=0) | np.all(contains_ab, axis=0)]
How can I proceed from here or is there a better way to approach this.
Solution 1:[1]
You can use numpy.logical_and.reduce:
import numpy as np
letters = [['D','C'], ['A','B']]
for l in letters:
df['string'] = (df['string']
.mask(np.logical_and.reduce(
[df['string'].str.contains(x)
for x in l]), l[0]*4)
)
output:
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 CCCC
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 BBBB
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA
Solution 2:[2]
You could use this:
ab = df['string'].str.match(r'^[AB]+$')
cd = df['string'].str.match(r'^[CD]+$')
newdf = df.assign(string=df['string'].where(~ab, 'AAAA').where(~cd, 'DDDD'))
>>> newdf
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 DDDD
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 AAAA
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA
Any string that matches (fully) any combination of 'C' and 'D' is replaced by 'DDDD'. Similarly, any 'AB' combination becomes 'AAAA'. All other values are left unmodified.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
| Solution 2 | Pierre D |
