'Python and spreadsheet: given the user input, expand the list of results using regex

I'm working with Python on a movie database on google sheets. I want to build a query searching a list of names and surnames on three columns. I ask the users to enter a name and the output will be the corresponding rows and some columns. I would like the user to enter just a name or a surname or even just a few letters, case insensitive. I would like my results to match "foreign characters " like ü ø å ... even if the user types the most similar char. Input "uber", output "über" Thank you!

My approach is to create a pattern and match whatever is before and after my input. For instance "russel" will match Jay Russell, Chuck Russell.. and "cuaron" match Alfonso Cuarón.

import pandas as pd
import re

df = pd.read_csv(gsheet_url) #df is the whole spreadsheet

def get_actor():
    request_actor = input("Enter an actor: ")
    request_actor = request_actor.lower().title()
    if request_actor in df.values:
        mask1 = df['Actor1'].str.contains(request_actor)
        mask2 = df['Actor2'].str.contains(request_actor)
        mask3 = df['Actor3'].str.contains(request_actor)
        actor_data = df.loc[mask1 | mask2 | mask3, ['Title', 'Year', 'Genres', 'Director']]
        print('All the movies of the actor you were looking for\n', actor_data, '\n')
        print('Do you want to do a new search or find data?')
        welcome()

    else:
        print('The actor is not present in the database')
        print('Do you want to do a new search or find data?')
        welcome()


Solution 1:[1]

I paresed the dataframe twice. After the first iteration, if the input was not found among the df values, I applied the .normalize(), .encode(), .decode() methods to the df.

import pandas as pd
def get_actor():
    request_actor = input("Type an actor: ")
    request_actor = request_actor.lower().title().strip()
    search = False
    df_copy = df.copy(deep=True)
    for value in df_copy[['Actor1', 'Actor2', 'Actor3']].values:
        for item in value:
            if request_actor in str(item):
                search = True

    if not search:
        df_copy['Actor1'] = (
            df_copy['Actor1'].str.normalize('NFKD').str.encode(
                'ascii', errors='ignore').str.decode('utf-8'))
        df_copy['Actor2'] = (
            df_copy['Actor2'].str.normalize('NFKD').str.encode(
                'ascii', errors='ignore').str.decode('utf-8'))
        df_copy['Actor3'] = (
            df_copy['Actor3'].str.normalize('NFKD').str.encode(
                'ascii', errors='ignore').str.decode('utf-8'))
        for value in df_copy[['Actor1', 'Actor2', 'Actor3']].values:
            for item in value:
                if request_actor in str(item):
                    search = True

    if search:
        mask1 = df_copy['Actor1'].str.contains(request_actor)
        mask2 = df_copy['Actor2'].str.contains(request_actor)
        mask3 = df_copy['Actor3'].str.contains(request_actor)
        actor_data = df_copy.loc[mask1 | mask2 | mask3, [
            'Title', 'Genres', 'Director', 'Actor1', 'Actor2', 'Actor3',
            'IMDB Score']]
        print(actor_data)
    else:
        print("This actor was not present in the database")

Solution 2:[2]

Most importantly I would like to my results to "foreign carachters" like ü ø å ... regardles of what the users types (they might not have those char immediately available or not sure about the spelling.

Regarding diacritics you might use unicodedata (it is part of standard library) to remove them as follows

import unicodedata
def remove_diacritics(text):
    return ''.join(i for i in unicodedata.normalize('NFD',text) if i.isascii())
print(remove_diacritics("Cuarón"))

output

Cuaron

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cla.cif
Solution 2 Daweo