'Is there a way to identify duplicated rows based on partial text/string?
I am trying to flag the duplicated item based on partial text/string. For example, I have a dataset like below, the description for inventory 1 both contain the text of "Sales number decrease", where inventory 2 both contain the text of "sales number increased". The desired output is to create a "duplicate" column and flag these records as "Y".
Is there any partial string matching approach or any other approach that can achieve this result? Be noted the below is a fake and simplified dataset to present my idea. In my read dataset, I have over thousand rows and some texts are in the middle of the sentence with upper vs lower case as well as present over past tense. I want to explore some ideas. Thanks.
Below is the code
import pandas as pd
df1 = { 'item':['item1','item2','item3','item4','item5','item6'],
'name':['inventory1','inventory1','inventory2','inventory2','inventory3','inventory3'],
'code':[1,1,2,2,3,3],
'description':['sales number decrease compared to last month', 'Sales number
decreased','sales number increased','Sales number increased, need to keep kpi','no sales
this month','item out of stock']}
df1=pd.DataFrame(df1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

