'Find a pattern in middle of multiple sentences
I have a dataframe as below,
data = [
[ 1, 'AR-123456' ],
[ 1, '123456' ],
[ 2, '345678' ],
[ 3,'Application-12345678901'],
[ 3, '12345678901']
]
df = pd.DataFrame(data, columns=['Case', 'ID'] )
| Case | ID |
|---|---|
| 1 | AR-123456 |
| 1 | 123456 |
| 2 | 345678 |
| 3 | Application-12345678901 |
| 3 | 12345678901 |
So basically I am trying to remove rows where for the same Case, the IDs are digits from AR- or Application- i.e., the final expected output :-
| Case | ID |
|---|---|
| 1 | AR-123456 |
| 2 | 345678 |
| 3 | Application-12345678901 |
Solution 1:[1]
Extract the digits and drop_duplicates:
df["digits"] = df["ID"].str.extract("(\d+)")
output = df.drop_duplicates(["Case","digits"]).drop("digits",axis=1)
>>> output
Case ID
0 1 AR-123456
2 2 345678
3 3 Application-12345678901
Solution 2:[2]
You can groupby Case column and then drop the duplicated ID rows that is in AR/Application item
out = (df.groupby('Case')
.apply(lambda g: g[~g['ID'].isin(g['ID'].str.extract('(AR|Application)-(\d+)')[1])])
.reset_index(drop=True))
print(out)
Case ID
0 1 AR-123456
1 2 345678
2 3 Application-12345678901
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | not_speshal |
| Solution 2 | Ynjxsjmh |
