'Rolling append to Avro file with size threshold in Python
I have a bunch of objects that I'd like to export to Avro files. The amount of data can be arbitrarily large, so I'd like to write them into fixed-sized parts where each part is smaller than a THRESHOLD. For example, say I have 13 GB of data that I'd like two write into 5GB volumes, I'd like my files output files to be
/out/objects-001.avro size: 5GB
/out/objects-002.avro size: 5GB
/out/objects-003.avro size: 3GB
Now my problem is that when I use DataFileWriter to append the objects to the file, there's no method that can return the size of the file written so far in bytes. So I don't know when the current output file crosses the threshold size.
I was looking to write something along the lines of:
THRESHOLD = 500000000
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
#.
#.
#.
while objects_iter.hasNext():
if writer.bytesWritternSoFar() < THRESHOLD:
writer.append(objects_iter.nextObject())
else:
# closr the current writer and open a new one
I'd like to mention that the size of the objects is not fixed, so I cannot approximate the size on a disk by counting the objects.
Any idea how to know the file size without resorting to forking the Avro code and exposing this size?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
