'Size of encoded avro message without encoding it

Is there a way to get the size of the encoded avro message without actually encoding it?

I'm using Avro 1.8.1 for C++.

I'm used to google protocol buffers where you can call ByteSize() on a protobuf to get the encoded size, so it's something similar i'm looking for.

Since the message in essence is a raw struct I get that the size cannot be retrieved from the message itself, but perhaps there is a helper method that i'm not aware of?



Solution 1:[1]

(Edit below shows a hacky way to shrink-to-fit an OutputStream after writing to it with a BinaryEncoder)

It's a shame that avro::encode() doesn't use backup on the OutputStream to free unused memory after encoding. Martin G's answer gives the best solution using only the tools avro provides, but it issues N memory allocations of 1 byte each if your serialized object is N bytes in size.

You could implement a custom avro::OutputStream that simply counts and discards all written bytes. This would get rid of the memory allocations. It's still not a great approach, as the actual encoder will have to "ask" for every single byte:

(Code untested, just for demonstration purposes)

#include <avro/Encoder.hh>
#include <cstdint>

class ByteCountOutputStream : public avro::OutputStream {
public:
    size_t byteCount_ = 0;
    uint8_t dummyWriteLocation_;

    explicit ByteCountOutputStream() {};

    bool next(uint8_t **data, size_t *len) final {
        byteCount_ += 1;
        *data = &dummyWriteLocation_;
        *len = 1;
        return true;
    }

    void backup(size_t len) final {
        byteCount_ -= len;
    }

    uint64_t byteCount() const final {
        return byteCount_;
    }

    void flush() final {}
};

this could then be used as:

MyAvroStruct obj;

avro::EncoderPtr encoder = avro::binaryEncoder();
ByteCountOutputStream out();
encoder->init(out);
avro::encode(*encoder, obj);
size_t bufferSize = out.byteCount();

Edit: My initial question when stumbling upon this was: How can I tell how many bytes of the OutputStream are required (for storing / transmitting)? Or, equivalently, if OutputStream.byteCount() returns the count of bytes allocated by the encoder so far, how can I make the encoder "backup" / release the bytes it didn't use? Well, there is a hacky way:

The Encoder abstract class provides a init method. For the BinaryEncoder, this is currently implemented as:

void BinaryEncoder::init(OutputStream &os) {
    out_.reset(os);
}

with out_ being the internal StreamWriter of the Encoder.

Now, the StreamWriter implements reset as:

    void reset(OutputStream &os) {
        if (out_ != nullptr && end_ != next_) {
            out_->backup(end_ - next_);
        }
        out_ = &os;
        next_ = end_;
    }

which will return unused memory back to the "old" OutputStream before switching to the new one.

So, you can abuse the encoder's init method like this:

// setup as always
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();

// actual serialization
encoder->init(*out);
avro::encode(*encoder, obj);

// re-init on the same OutputStream. Happens to shrink the stream to fit
encoder->init(*out);
size_t bufferSize = out->byteCount();

However, this behavior is not documented, so it might break in the future.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1