'Read JavaRdd column wise in Apache Spark

I'm trying to understand how JavaRdd was used to read data column wise using Java.

JavaRdd is guess very first data structure from spark. Now in new versions, there are different ways of converting JavaRdd to Dataset<Row> which are very optimized data structures.

For example

StructType schema = new StructType(new StructField[]{
                new StructField("_1", DataTypes.StringType, false, Metadata.empty()),
                new StructField("_2", DataTypes.StringType, false, Metadata.empty()),
                new StructField("_3", DataTypes.StringType, false, Metadata.empty())
        });

        JavaRDD<String> rdd1 = spark
                .range(5)
                .javaRDD()
                .map(s -> s+",b,c");

        JavaRDD<Row> rdd2 = rdd1.map(s -> s.split(","))
                .map(s -> RowFactory.create((Object[]) s));

        Dataset<Row> df = spark.createDataFrame(rdd2, schema);

        df.show();

That's is fine.

But can anyone share any sample code how csv files were used to handle in times when JavaRdd was only option ? I'm curious to see and understand how column names were managed ? How columns were selected out of JavaRdd ? How columns were defined using JavaRdd ?

Can anyone share something on this ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source