'Pyspark not writing correctly csv file

Here's how pyspark wrote my file.

Do you have any idea?

Here's the code:

output_file_path = '/tmp/users/csv'
df=spark.read.parquet("/user/hive/warehouse/tmp.db/users/*.parq")
df.coalesce(1).write.format('com.databricks.spark.csv').mode('overwrite').option("header", "true").save(output_file_path)

enter image description here



Solution 1:[1]

The problem was, the create table statement used to generate parquet files (files that I use later when I read with dataframe), create data in binary format. The solution is to cast all your columns to the right format (when you read it with a dataframe). Thank you all for your help :)

Solution 2:[2]

I think the problem is that you have not specified the delimiter

output_file_path = '/tmp/users/csv'
df=spark.read.parquet("/user/hive/warehouse/tmp.db/users/*.parq")
df.coalesce(1).write.format('com.databricks.spark.csv').mode('overwrite').option("header", "true").option("delimiter",",").save(output_file_path)

Solution 3:[3]

It's not "values" that are being stored in binary format, it's strings, and there is Spark parameter to handle it.

spark.sql.parquet.binaryAsString

Property Name Default Meaning Since Version
spark.sql.parquet.binaryAsString false Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. 1.1.1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Fares DAOUD
Solution 2 seghair tarek
Solution 3 David דודו Markovitz