'how to export to parquet from pandas dataframe to avoid error: parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

I am quite new to Spark and I am quite confused from the error I get. The suggestions I have found online mention uncompatible schemas but I am not sure that applies to my case as well.

I am trying to export a pandas dataframe to parquet file so that I then can read those parquet files in to a Spark dataframe and do operations on them. When trying to do a join on the spark dataframe built by reading the parquet datafiles I get the error:

java.lang.UnsupportedOperationException: parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
    at parquet.column.Dictionary.decodeToLong(Dictionary.java:52)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)

So far I have exported my parquet files with the two set of parameters (none of them work):

df.to_parquet( filepath, compression='gzip', index=False )

or

df.to_parquet(filepath, compression=None, engine='pyarrow', index=False)

I have read all the parquet files I have written in spark with spark.read.parquet(file/path/to/rootfolder). After the reading, the schema of the read dataframe is as follows:

root
 |-- Date: string (nullable = true)
 |-- ItemID: long (nullable = true)
 |-- StoreID: long (nullable = true)
 |-- HistoryID: long (nullable = true)
 |-- Sales: long (nullable = true)
 |-- Price: double (nullable = true)
 |-- dept: string (nullable = true)
 |-- category: string (nullable = true)

After that I just do a groupBy on that data frame and a join and I get the error above.

catmin = df.groupBy('category').agg(
 F.min('Date').alias('StartAv'))

joint = aggsales.join(catmin, on='category', how='left').select(df['*'],catmin['StartAv'])

joint.show(10)

where df above is my dataframe read from parquet.

How could I change (if that is the problem) the export to parquet files to succeed at calculating the join?

My version of Spark is v2.4.0.cloudera2



Solution 1:[1]

Just use below logic to write your dataframe in parquet format -

df = spark.createDataFrame(PandasDf)

df.write.parquet(<Your File Path>)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DKNY