'Pandas memory error for large dataframe pivot
I am tring to perform a .pivot(index=x,columns=y,values=z).fillna(0) on a large pandas dataframe (25.000.000 rows).
The expected pivot output should be of 1.500.000 x 100.000 size.
The issue i'm facing is that since pandas load dataframe in memory, for such a big df I do not have enough memory and it results in MemoryError.
I see that it is suggested to perform operations in chunks, but in this case (constructing pivot table), I do not think it is feasible ?
Is there any alternative I could use (maybe write on disk, use another lib to perform) to perform this operation ?
Solution 1:[1]
You can use set_index and unstack instead of pivot:
>>> df = pd.DataFrame({'x': list('ABCABCABC'),
'y': list('XXXYYYZZZ'),
'z': range(1, 10)})
x y z
0 A X 1
1 B X 2
2 C X 3
3 A Y 4
4 B Y 5
5 C Y 6
6 A Z 7
7 B Z 8
8 C Z 9
Using pivot
>>> df.pivot(index='x', columns='y', values='z')
y X Y Z
x
A 1 4 7
B 2 5 8
C 3 6 9
Using set_index/unstack:
>>> df.set_index(['x', 'y'])['z'].unstack('y')
y X Y Z
x
A 1 4 7
B 2 5 8
C 3 6 9
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
