'Pyspark not writing correctly csv file
Solution 1:[1]
The problem was, the create table statement used to generate parquet files (files that I use later when I read with dataframe), create data in binary format. The solution is to cast all your columns to the right format (when you read it with a dataframe). Thank you all for your help :)
Solution 2:[2]
I think the problem is that you have not specified the delimiter
output_file_path = '/tmp/users/csv'
df=spark.read.parquet("/user/hive/warehouse/tmp.db/users/*.parq")
df.coalesce(1).write.format('com.databricks.spark.csv').mode('overwrite').option("header", "true").option("delimiter",",").save(output_file_path)
Solution 3:[3]
It's not "values" that are being stored in binary format, it's strings, and there is Spark parameter to handle it.
spark.sql.parquet.binaryAsString
| Property Name | Default | Meaning | Since Version |
|---|---|---|---|
| spark.sql.parquet.binaryAsString | false | Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. | 1.1.1 |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Fares DAOUD |
| Solution 2 | seghair tarek |
| Solution 3 | David דודו Markovitz |

