'Read large binary file in chunks using numpy.fromfile count attribute
I have a large binary file (9GB) which I need to read in chunks and save as a CSV (perhaps split into multiple CSV files) for later processing. The following code is what I am using to read the entire binary file. However, the file is too large, and I need to build a while loop or for loop in order to read the binary file contents in chunks. I found in the documentation that 'numpy.fromfile' has an attribute 'count' which can take a number of items to read. However, I would like to continue reading the file in chunks (for example if I set count=100,000, it only reads the first 100k rows and that is it, however assuming my entire bin file has 1M rows, I would expect my code to read my binary file in 10 chunks and this should yield me a final CSV file or perhaps 10 separate CSV files) until the file is exhausted.
Here is my code to read the entire binary file.
dt = np.dtype([('Time','I'), ('Z', 'H'), ('Y', 'H'), ('X', 'H')])
data = np.fromfile('MyFile.bin', dtype=dt, sep='')
data_df = pd.DataFrame(data)
#SOME MORE DATA PROCESSING#
data_df.to_csv('Output\FinalOutput.csv')
I am converting it into a DF because I need to make some more data processing.
Solution 1:[1]
You can use the offset parameter of the numpy fromfile function
Here it is a sample code to read a binary file with an offset:
import numpy as np
x = np.random.rand(10000)
x.astype('float64').tofile("x.bin")
y = np.fromfile("x.bin", count=100, offset=0)
np.testing.assert_equal(x[:100], y)
y = np.fromfile("x.bin", count=100, offset=800)
np.testing.assert_equal(x[100:200], y)
The offset parameters takes the byte where start reading, giving that I saved the values as float64 skipping the first 100 elements requires 800 bytes (8 bytes each element), knowing the data that you are facing you can calculate the number of bytes that you need to use as offset
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | tia.milani |
