'Count unique groups within a pandas data frame

I have a data frame of patent numbers and the inventors who invented those patents. For example:

patent_number	inventor_id
1	A
1	B
2	B
2	C
3	A
3	B

I define a team as a group of inventors who produce a patent together. E.g. the team (A,B) produced patent 1, (B,C) patent 2 and again (A,B) produced patent 3. I want to count the number of unique teams. In this case the answer is 2.

What is the fastest way of counting the number of unique teams using python?

I have written this code, but it is very slow when I run it on my entire data set which includes over 6 million patent numbers and 3.5 million unique inventor ids.

teams = []

for pat_id, pat_df in inventor_data.groupby("patent_number"):

    if list(pat_df["inventor_id"]) not in teams:
    
        teams.append(list(pat_df["inventor_id"]))

print("Number of teams ", len(teams))

I am looking for speed improvements. If you can help me with understand the reasons why they are faster I am always keen to learn about this.

Thank you!

Solution 1:^[1]

You could go for:

   inventor_data = inventor_data.sort_values("inventor_id")
   inventor_data.groupby("patent_number").inventor_id.sum().nunique()

A few explanations:

Sorting the values is mandatory to avoid symmetries, and consider (A,B) and (B,A) as a single team.
You can sum the strings "A" and "B" to produce a string "AB" representing the team (A, B)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Grégoire

'Count unique groups within a pandas data frame

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]