'How to group a pandas dataframe by array intersection

Say I have a DataFrame like below

  UUID             domains
0  asd   [foo.com, foo.ca]
1  jkl    [foo.ca, foo.fr]
2  xyz            [foo.fr]
3  iek  [bar.com, bar.org]
4  qkr           [bar.org]
5  kij          [buzz.net]

How can I turn it in to something like this?

  UUID
0  [asd, jkl, xyz]
1  [iek, qkr]
2  [kij]

I want to group all the UUIDs where any domain is present in any other domains column. For example, rows 0 and 1 both contain foo.ca and rows 1 and 2 both contain foo.fr so should be grouped together.

The size of my data set is millions of rows so I can't brute force it.



Solution 1:[1]

We can do explode first then use networkx

import networkx as nx
s = df.explode('domains')
G = nx.from_pandas_edgelist(s, 'UUID', 'domains')
out = pd.Series([[y for y in x if y not in s.domains.tolist()] for x in [*nx.connected_components(G)]])
Out[209]: 
0    [xyz, jkl, asd]
1         [iek, qkr]
2              [kij]
dtype: object

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 BENY