'Benefit of saving data in Hadoop Sequence File vs Parquet format

So part of an unfinished spark pipeline that I am working on is saving data as a Hadoop Sequence file. However, as I try to finish the pipeline I found some bug(https://issues.apache.org/jira/browse/HDFS-14762) that prevent me to read the saved sequence file. My intuition is to instead save the data as parquet instead. Nevertheless, since the previous engineer(not in contact anymore) who made the pipeline from the beginning, explicitly save the data as Hadoop sequence files, I am wary that there might be benefits of saving data as sequence files instead of parquet.

EDIT

So what the code does is archiving messages from a Kafka topic. Something like this

  final case class Message(partition: Int, offset: Long, msg: Array[Byte])

  final case class SortKey(id: Int)

    val records: RDD[(SortKey, Message)] =
      KafkaUtils
        .createRDD[String, Array[Byte]](spark.sparkContext,kafkaConfig, offsetRanges, PreferConsistent)
.map {
 record => 
( SortKey(record.id), Message(record.partition(),record.offset(),record.value()) ) 
}

records.map { case (_, rec) =>
          val key = //Get unique key from topic and partition
          (new BytesWritable(key.getBytes), new BytesWritable(rec.msg))
        }.saveAsSequenceFile(
          "/tmp/archive-2022-01-01T00:00:00",
          Some(classOf[DefaultCodec])
        )

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Benefit of saving data in Hadoop Sequence File vs Parquet format

Sources

Related Questions