'Check for a list of word in DataFrame pandas but skipping those words that are not in

so i'm trying to filter a dataframe using a list of words. The problem is that some words could be not there but anyway could be useful.

These dataframe is a catalog that I'm getting from a web scraping process. For every single row, I have a different product and unique.

The list that I'm using came from another process and could have words that are not useful because it not appear in the string and i can modify it.

For example let's think that we have the next dataframe:

mycolumn = ['Products']
products = ['Kitadol 500 mg x 24 Comprimidos',
            'Paracetamol 500 mg',
            'Prestat 75 mg x 40 Comprimidos',
            'Pedialyte 60 Manzana x 500 mL Solución Oral',
            'Panadol Niños 100mg/Ml Gotas 15ml']
df = pd.DataFrame(products, columns=mycolumn)

And i have the next list of words:

list_words = ['PARACETAMOL','KITADOL','500','MG','LIB']

In my dataset of products, 'Kitadol 500 mg x 24 Comprimidos' is a product where "Kitadol" is a commercial name and the molecule, that is "Paracetamol", is not in the description. My principal problem is that if i want to ask for all those product that have "paracetamol", "'Kitadol 500 mg x 24 Comprimidos" will not be in my search.

My list of words contains keywords from a dictionary, so for example, if I search in my dictionary for "Paracetamol" and "500", I'm getting this keywords (could be more).

My purpose is get from my dataset, all those products that contain this words and for those words that are not contain, i want to skip it. For example that product with the description 'Paracetamol 500 mg', doesn't conatin "Kitadol", so there will be no a full match, just with the keywords "Paracetamol", "500" and "MG". Right now i don't know how to show all those products that contains at least some of these keywords and ignore those that are not contain.

My final table need to contains two products:

  • Kitadol 500 mg x 24 Comprimidos
  • Paracetamol 500 mg

I wonder if someone know how to deal with this question or give some ideas. King regards and thanks!



Solution 1:[1]

If is possible specify what exactly need for each match - here is necessary match 3 values of tuples use Series.str.findall with re.I for ignore case and test length of unique values is same like length of each tuple in list comprehension:

import re

tups = [('PARACETAMOL','500','MG'), ('KITADOL','500','MG')]

L = [df['Products'].str.findall("|".join(i),flags=re.I).apply(lambda x: len(set(x)))==len(i)
     for i in tups]

df = df[np.logical_or.reduce(L)]
print (df)
                          Products
0  Kitadol 500 mg x 24 Comprimidos
1               Paracetamol 500 mg
    

Solution 2:[2]

Your question not clear. I am therefore making assumptions as follows. Going by your output, you want to find medication. Capacity of medication has to be exempt. my attempt below

s=[x for x in list_words if x.isalpha() if x not in ['MG','ML','X']]
df[df['Products'].str.upper().str.split('\s').map(set).apply(lambda x: len(x.intersection(set(s))))>0]

     

      Products
0  Kitadol 500 mg x 24 Comprimidos
1               Paracetamol 500 mg

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jezrael
Solution 2 wwnde