'How to read a .csv with multiple lines in spark - scala?
I am trying to read a .csv file with multiple lines in a record in spark-scala.
id ,name
3221,uhbjh
12233,"My name is
ydbc"
2333,jdhv
I was using the below code initially, which returns more records because of multiple lines in a record
val df = spark.read
.option("header", "true")
.option("sep", ",")
.option("inferSchema", false)
.csv(s"${file_path}")
returns:
| id | name|
+-----+-----------+
| 3221| uhbjh|
|12233|My name is |
|ydbc"| null|
| 2333| jdhv|
+-----+-----------+
Schema as:
scala> df.printSchema()
root
|-- id : string (nullable = true)
|-- name: string (nullable = true)
When I figured out there are multiple line records, I have used
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("multiLine", true)
.csv(s"${input_file}")
returns:
+-----+----------------+
| id | name
+-----+----------------+
| 3221| uhbjh
|12233|My name is
ydbc|
| 2333| jdhv|
+-----+----------------+
Schema as:
scala> df.printSchema()
root
|-- id : string (nullable = true)
: string (nullable = true)
Any help with what options should I use to overcome this issue?
Solution 1:[1]
You did well on your second try. You just need to clean your data (remove break char).
Try that :
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("multiLine", true)
.csv(s"${input_file}")
val clean_df = df.withColumn("name",regexp_replace(col("name"), "[\n\r]", " "))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Gohmz |
