'How do I match variations of a pandas string based on a list?

I have a pandas dataframe with one column containing country names and I'd like to flag them if they appear in a list of countries I have. However, some of the countries have been entered incorrectly.

Eg, 'republic of panama' has been entered as just Panama, Panama city, or Panama state. The country in my list the check is being performed against only has 'Republic of Panama'. Therefore all the variations are not being captured. This applies to many other countries too, like 'St Lucia' which should be 'Saint Lucia'. I am considering how to match the pandas column, to see if anything in the string matches any word in that name, in my list.

Eg.

list = ['Republic of Panama', 'Hong Kong', 'Chile']

All iterations of Panama should match in my pandas column, because the word 'Panama' is present. Can anyone suggest how to split the list per string, so each entry is separate but split allowing a match of this kind?

Thanks



Solution 1:[1]

It we take your country of columns to be "country" in a dataframe that maybe looks like this:

import pandas as pd

df = pd.DataFrame({"country": ["Republic of Panama", "Panama City", "Panama", "St Lucia", "Saint Lucia", "Chile", "Poland", "Germany", "Hong Kong"]})
df
#              country
#0  Republic of Panama
#1         Panama City
#2              Panama
#3            St Lucia
#4         Saint Lucia
#5               Chile
#6              Poland
#7             Germany
#8           Hong Kong

If your list is of countries is then ["Panama", "Hong Kong", "Lucia"] (you should not call a list list), you can use .str.contains() which can take a regular expression:

l = ["Panama", "Hong Kong", "Lucia"]
l2 = "|".join(l)
l2
#'Panama|Hong Kong|Lucia'

| means "or", so it will look for "Panama" or "Hong Kong" or "Lucia".

df[df["country"].str.contains(l2)]
#              country
#0  Republic of Panama
#1         Panama City
#2              Panama
#3            St Lucia
#4         Saint Lucia
#8           Hong Kong

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rawson