'how to export to parquet from pandas dataframe to avoid error: parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
I am quite new to Spark and I am quite confused from the error I get. The suggestions I have found online mention uncompatible schemas but I am not sure that applies to my case as well.
I am trying to export a pandas dataframe to parquet file so that I then can read those parquet files in to a Spark dataframe and do operations on them. When trying to do a join on the spark dataframe built by reading the parquet datafiles I get the error:
java.lang.UnsupportedOperationException: parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
at parquet.column.Dictionary.decodeToLong(Dictionary.java:52)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
So far I have exported my parquet files with the two set of parameters (none of them work):
df.to_parquet( filepath, compression='gzip', index=False )
or
df.to_parquet(filepath, compression=None, engine='pyarrow', index=False)
I have read all the parquet files I have written in spark with spark.read.parquet(file/path/to/rootfolder). After the reading, the schema of the read dataframe is as follows:
root
|-- Date: string (nullable = true)
|-- ItemID: long (nullable = true)
|-- StoreID: long (nullable = true)
|-- HistoryID: long (nullable = true)
|-- Sales: long (nullable = true)
|-- Price: double (nullable = true)
|-- dept: string (nullable = true)
|-- category: string (nullable = true)
After that I just do a groupBy on that data frame and a join and I get the error above.
catmin = df.groupBy('category').agg(
F.min('Date').alias('StartAv'))
joint = aggsales.join(catmin, on='category', how='left').select(df['*'],catmin['StartAv'])
joint.show(10)
where df above is my dataframe read from parquet.
How could I change (if that is the problem) the export to parquet files to succeed at calculating the join?
My version of Spark is v2.4.0.cloudera2
Solution 1:[1]
Just use below logic to write your dataframe in parquet format -
df = spark.createDataFrame(PandasDf)
df.write.parquet(<Your File Path>)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DKNY |
