'Benefit of saving data in Hadoop Sequence File vs Parquet format
So part of an unfinished spark pipeline that I am working on is saving data as a Hadoop Sequence file. However, as I try to finish the pipeline I found some bug(https://issues.apache.org/jira/browse/HDFS-14762) that prevent me to read the saved sequence file. My intuition is to instead save the data as parquet instead. Nevertheless, since the previous engineer(not in contact anymore) who made the pipeline from the beginning, explicitly save the data as Hadoop sequence files, I am wary that there might be benefits of saving data as sequence files instead of parquet.
EDIT
So what the code does is archiving messages from a Kafka topic. Something like this
final case class Message(partition: Int, offset: Long, msg: Array[Byte])
final case class SortKey(id: Int)
val records: RDD[(SortKey, Message)] =
KafkaUtils
.createRDD[String, Array[Byte]](spark.sparkContext,kafkaConfig, offsetRanges, PreferConsistent)
.map {
record =>
( SortKey(record.id), Message(record.partition(),record.offset(),record.value()) )
}
records.map { case (_, rec) =>
val key = //Get unique key from topic and partition
(new BytesWritable(key.getBytes), new BytesWritable(rec.msg))
}.saveAsSequenceFile(
"/tmp/archive-2022-01-01T00:00:00",
Some(classOf[DefaultCodec])
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
