'Is there a way to identify duplicated rows based on partial text/string?

I am trying to flag the duplicated item based on partial text/string. For example, I have a dataset like below, the description for inventory 1 both contain the text of "Sales number decrease", where inventory 2 both contain the text of "sales number increased". The desired output is to create a "duplicate" column and flag these records as "Y".

Is there any partial string matching approach or any other approach that can achieve this result? Be noted the below is a fake and simplified dataset to present my idea. In my read dataset, I have over thousand rows and some texts are in the middle of the sentence with upper vs lower case as well as present over past tense. I want to explore some ideas. Thanks.

Below is the code

import pandas as pd
df1 = { 'item':['item1','item2','item3','item4','item5','item6'],
  'name':['inventory1','inventory1','inventory2','inventory2','inventory3','inventory3'],
  'code':[1,1,2,2,3,3],
  'description':['sales number decrease compared to last month', 'Sales number 
  decreased','sales number increased','Sales number increased, need to keep kpi','no sales 
  this month','item out of stock']}


df1=pd.DataFrame(df1)

The desired output is like below:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Is there a way to identify duplicated rows based on partial text/string?

Sources

Related Questions