'Python: Removing value from cell if it's not always the successor of other column-values

I've been fighting with a problem for days now. I have a list of 15k article names and separated the first two words of the article name to be able to identify the brand. My logic here is that for every fabricator, check if the second word always follows the same first word. If this is true, this indicates that the split is the full brand name and everything is ok. If not, the second word shall be removed, as it indicates that not only the brand name, but part of the product name is also included. Checking the fabricator is important as 'split1' might be the same for different fabricators, indicating a different brand.

TL;DR: Set 'split2' to NaN if it not always the succesor of 'split1' & 'fabricator'.

This is a simplified version of my dataframe:

d = {'fabricator':['coca cola corp.','coca cola corp.','coca cola corp.','haribo ltd','haribo ltd','haribo ltd'],'product name': ['coca cola light', 'coca cola zero', 'fanta', 'haribo gold bears', 'haribo gold bears soft','haribo berries'], 'split1': ['coca', 'coca','fanta', 'haribo', 'haribo','haribo'], 'split2': ['cola', 'cola',np.nan, 'gold','gold', 'berries']}
df = pd.DataFrame(data=d)
print(df)
        fabricator            product name  split1   split2
0  coca cola corp.         coca cola light    coca     cola
1  coca cola corp.          coca cola zero    coca     cola
2  coca cola corp.                   fanta   fanta      NaN
3       haribo ltd       haribo gold bears  haribo     gold
4       haribo ltd  haribo gold bears soft  haribo     gold
5       haribo ltd          haribo berries  haribo  berries

This is how it shall look like in the end:

        fabricator            product name  split1 split2
0  coca cola corp.         coca cola light    coca   cola
1  coca cola corp.          coca cola zero    coca   cola
2  coca cola corp.                   fanta   fanta    NaN
3       haribo ltd       haribo gold bears  haribo    NaN
4       haribo ltd  haribo gold bears soft  haribo    NaN
5       haribo ltd          haribo berries  haribo    NaN 

This code works as expected for a small df, but is extremely slow for a large dataset & not "pythonesque":

for h in df['fabricator']:
     for s1 in df['split1']:
         if df.loc[(df['fabricator] == f) & (df['split1'] == s1), 'split2'].nunique() > 1:
             df.loc[(df['fabricator] == f) & (df[split1'] == s1), 'split2'] = str('')

I tried combining groupby(['fabricator','split1']) with .duplicated() and .transform() with no success. Any help would be much appreciated! Thank you!



Solution 1:[1]

This should do the trick (it has been substantially edited from when I first posted it):

pair_dict = {}
nanmapset = set()
for split1, split2 in df[['split1', 'split2']].values:
    if split1 not in pair_dict:
        pair_dict[split1] = split2
    elif split2 != pair_dict[split1]:
        nanmapset.add((split1, split2))
        nanmapset.add((split1, pair_dict[split1]))
df['split2'] = df.apply(lambda row: row['split2'] if (row['split1'], row['split2']) not in nanmapset else np.nan, axis=1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1