'Pandas memory error for large dataframe pivot

I am tring to perform a .pivot(index=x,columns=y,values=z).fillna(0) on a large pandas dataframe (25.000.000 rows).

The expected pivot output should be of 1.500.000 x 100.000 size.

The issue i'm facing is that since pandas load dataframe in memory, for such a big df I do not have enough memory and it results in MemoryError.

I see that it is suggested to perform operations in chunks, but in this case (constructing pivot table), I do not think it is feasible ?

Is there any alternative I could use (maybe write on disk, use another lib to perform) to perform this operation ?



Solution 1:[1]

You can use set_index and unstack instead of pivot:

>>> df = pd.DataFrame({'x': list('ABCABCABC'),
                       'y': list('XXXYYYZZZ'),
                       'z': range(1, 10)})
   x  y  z
0  A  X  1
1  B  X  2
2  C  X  3
3  A  Y  4
4  B  Y  5
5  C  Y  6
6  A  Z  7
7  B  Z  8
8  C  Z  9

Using pivot

>>> df.pivot(index='x', columns='y', values='z')
y  X  Y  Z
x         
A  1  4  7
B  2  5  8
C  3  6  9

Using set_index/unstack:

>>> df.set_index(['x', 'y'])['z'].unstack('y')
y  X  Y  Z
x         
A  1  4  7
B  2  5  8
C  3  6  9

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien