'Transforming a data frame column into a counter matrix
I have a data frame like this. Invoices are transformed into a table where each row has a single product. This table consists of millions of rows -
| invoice | products |
|---|---|
| INVC1 | xx |
| INVC1 | yy |
| INVC1 | zz |
| INVC1 | aa |
| INVC2 | xx |
| INVC2 | aa |
| INVC2 | bb |
| INVC2 | cc |
Now I want to create a table like the one below where I can see how many times each product is purchased with other products-
| xx | yy | zz | aa | bb | cc | |
|---|---|---|---|---|---|---|
| xx | 0 | 1 | 1 | 2 | 1 | 1 |
| yy | 1 | 0 | 1 | 1 | 0 | 0 |
| zz | 1 | 1 | 0 | 1 | 0 | 0 |
| aa | 2 | 1 | 1 | 0 | 1 | 1 |
| bb | 1 | 0 | 0 | 1 | 0 | 1 |
| cc | 1 | 0 | 0 | 1 | 1 | 0 |
Is there any numpy/pandas way to create a table like this? Or, is there any optimized way to do this.
Solution 1:[1]
Let us use pd.crosstab then dot
s = pd.crosstab(df['invoice'],df['products'])
out = s.T.dot(s)
Out[196]:
products aa bb cc xx yy zz
products
aa 2 1 1 2 1 1
bb 1 1 1 1 0 0
cc 1 1 1 1 0 0
xx 2 1 1 2 1 1
yy 1 0 0 1 1 1
zz 1 0 0 1 1 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BENY |
