'Serialization of a pandas DataFrame
Is there a fast way to do serialization of a DataFrame?
I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.
How can I save data frame in a binary format that can be loaded rapidly?
Solution 1:[1]
DataFrame.to_msgpack is experimental and not without some issues e.g. with Unicode, but it is much faster than pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.
Solution 2:[2]
Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Felix D. |
| Solution 2 | Achim |
