'Spark Java - how to convert a non delimited file into dataset in spark java

I need to read a non delimited file and convert it into dataset in spark java. Need to map to column names by reading from csv and splitting each line based on size of each attribute. Please suggest me how to do in spark java.



Solution 1:[1]

I cannot do that in Java as I use Scala, but one can FoldLeft applying successive substring or slice operations, or do that also without FoldLeft.

An example in Scala which you can convert - this is the less advanced option:

import org.apache.spark.sql.functions._
import spark.implicits._

// Cols for renaming.
val list = List("C1", "C2", "C3")

// Gen some data.
val df = Seq(
       ("C1111sometext999"),
       ("C2222sometext888"),
       ).toDF("data")

// "heavy" lifting.  
val df2 = df.selectExpr("substring(data, 0, 5)", "substring(data, 6,8)", "substring(data, 14,3)")

// Rename from list. Can also do "as Cn" in selectExpr. 
val df3 = df2.toDF(list:_*) 
df3.show

returns:

+-----+--------+---+
|   C1|      C2| C3|
+-----+--------+---+
|C1111|sometext|999|
|C2222|sometext|888|
+-----+--------+---+

You will then have to convert to types.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1