'How to use Hive 2.4.6 to read protobuf encoded sequence file

I have a sequence file that value is proto3 encoded byte array.

I looked into elephant-bird, which is very old and only support proto 2.x version. https://github.com/kevinweil/elephant-bird

Also it stops releasing new package and the latest one is already a couple of years old, so I don't think it is working anymore.

And I assume I am not the only one that runs into this issue, so here is the scenario.

I wrote an application to generate a sequence file with each (key, record), key is irrelevant, value is proto3 encoded byte array. When my app generated the file, it doesn't know/need to know the schema of the proto, it only takes in the byte array, put into the sequence file.

When I want to create a table in Hive so that I can query the data, I want to provide Hive with some infomation so that Hive can correctly create the table.

Elephant-bird gave the example as follow: https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive#reading-protocol-buffers

create table users
 row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
 "serialization.class"="com.example.proto.gen.Storage$User")
stored as
 inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat";

But since it is very old, for Hive 2.4.6 and proto3, is there some equivalent solution that someone can point me to?

Thank you.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source