'Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:

enter image description here

I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.

Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?



Solution 1:[1]

wholeFile does not exist (anymore?) in the spark api documentation: https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

This solution will work:

spark.read.option("multiLine", "true").csv("file.csv")

From the api documentation:

multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jurrit