'How to separate characters of a column based on its intersection with another column?
There are two columns in my df, the second column includes data of the other column+other characters (alphabets and/or numbers):
values = {
'number': [2830, 8457, 9234],
'nums': ['2830S', '8457M', '923442']
}
df = pd.DataFrame(values, columns=['number', 'nums'])
The extra characters are always after the common characters! How can I separate the characters that are not common between the two columns? I am looking for a simple solution, not a loop to check every character.
Solution 1:[1]
Replace common characters by empty string:
f_diff = lambda x: x['nums'].replace(x['number'], '')
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff, axis=1)
print(df)
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
Update
If number values are always the first characters of nums column, you can use a simpler function:
f_diff2 = lambda x: x['nums'][len(x['number']):]
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff2, axis=1)
print(df)
# Output
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
Solution 2:[2]
I would delete the prefix of the string. For this you can the method apply() to apply following function on each row:
def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
return text
df['nums'] = df.apply(lambda x: remove_prefix(x['nums'], str(x['number'])), axis=1)
df
Output:
number nums
0 2830 S
1 8457 M
2 9234 42
If you have python version >= 3.9 you only need this:
df['nums'] = df.apply(lambda x: x['nums'].removeprefix(x['number']), axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
| Solution 2 | JANO |
