'Using a lambda function to search for specific text then modify that text in the dataframe

junior dev here.

I'm trying to create an "onshore" and "offshore" tag for a report. I'm trying to use a lambda function to search for specific text in my data frame then override the cell in the data frame if conditions are met.

Dataset:

enter image description here

Desired Output:

enter image description here

Current Output:

enter image description here

Current Code:

df['Office_Address'] = df.apply(lambda x: "onshore" if ((x['Employment_Type'] == 'Consultant') and (x['Office_Address'] == str.contains('United States of America', regex = True))) else x['Office_Address'] == "offshore", axis = 1) 

I believe my issue is within the use of str.contains() which returns a boolean value, hence the current output. I'm not certain how to get the if statement to only trigger when it sees the partial text of "United States of America". My dataset varies in the placement of the United States of America tag but I know it's always spelled that way.



Solution 1:[1]

The reason you're seeing a True or False is because the final statements e.g.:

else x['Office_Address'] == "offshore"

is a boolean statement of whether that row value equals 'offshore' which evaluates to a boolean value. This can be updated to just return

"offshore"

So the full example would be:

df['Office_Address'] = df.apply(lambda x: "onshore" if ((x['Employment_Type'] == 'Consultant') and ('United States of America' in x['Office_Address'])) else "offshore", axis = 1) 

To improve your code even further you can take advantage of pandas' ability to operate on large amounts of data using column level operations. They are usually more efficient than something like pd.apply. (see https://realpython.com/fast-flexible-pandas/)

With that in mind, this example can do a direct column modification:

df.loc[(df['Employment_Type'] == 'Consultant') & (df['Office_Address'].str.contains('United States of America', regex = True)], 'Office_Address'] = "onshore" 
df.loc[!(df['Office_Address'] == 'onshore')], 'Office_Address'] = "offshore" 

The statement above uses .loc[rows, columns]. This can be thought of as set the values 'at the location of' these rows and these columns. In this case the rows to update are those which have Employment_Type equal to 'Consultant' and Office_Address containing the 'United States of America', and sets the Office_Address value for these rows to 'onshore'. Once it's done that, it sets the values for the remaining rows to 'offshore' by elimination.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1