'How to read a .csv with multiple lines in spark - scala?

I am trying to read a .csv file with multiple lines in a record in spark-scala.

id ,name
3221,uhbjh
12233,"My name is 
ydbc"
2333,jdhv

I was using the below code initially, which returns more records because of multiple lines in a record

val df = spark.read
      .option("header", "true")
      .option("sep", ",")
      .option("inferSchema", false)
      .csv(s"${file_path}")

returns:

|  id |       name|
+-----+-----------+
| 3221|      uhbjh|
|12233|My name is |
|ydbc"|       null|
| 2333|       jdhv|
+-----+-----------+

Schema as:

scala> df.printSchema()
root
 |-- id : string (nullable = true)
 |-- name: string (nullable = true)

When I figured out there are multiple line records, I have used

val df = spark.read
      .option("header", true)
      .option("sep", ",")
      .option("inferSchema", false)
      .option("multiLine", true)
      .csv(s"${input_file}")

returns:

+-----+----------------+
|  id |           name
+-----+----------------+
| 3221|          uhbjh
|12233|My name is
ydbc|
| 2333|            jdhv|
+-----+----------------+

Schema as:

scala> df.printSchema()
root
 |-- id : string (nullable = true)
: string (nullable = true)

Any help with what options should I use to overcome this issue?



Solution 1:[1]

You did well on your second try. You just need to clean your data (remove break char).

Try that :

val df = spark.read
      .option("header", true)
      .option("sep", ",")
      .option("inferSchema", false)
      .option("multiLine", true)
      .csv(s"${input_file}")

val clean_df = df.withColumn("name",regexp_replace(col("name"), "[\n\r]", " "))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gohmz