'How to group a pandas dataframe by array intersection
Say I have a DataFrame like below
UUID domains
0 asd [foo.com, foo.ca]
1 jkl [foo.ca, foo.fr]
2 xyz [foo.fr]
3 iek [bar.com, bar.org]
4 qkr [bar.org]
5 kij [buzz.net]
How can I turn it in to something like this?
UUID
0 [asd, jkl, xyz]
1 [iek, qkr]
2 [kij]
I want to group all the UUIDs where any domain is present in any other domains column. For example, rows 0 and 1 both contain foo.ca and rows 1 and 2 both contain foo.fr so should be grouped together.
The size of my data set is millions of rows so I can't brute force it.
Solution 1:[1]
We can do explode first then use networkx
import networkx as nx
s = df.explode('domains')
G = nx.from_pandas_edgelist(s, 'UUID', 'domains')
out = pd.Series([[y for y in x if y not in s.domains.tolist()] for x in [*nx.connected_components(G)]])
Out[209]:
0 [xyz, jkl, asd]
1 [iek, qkr]
2 [kij]
dtype: object
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BENY |
