'java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark

I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error:

df = spark.read.parquet("path/to/file.parquet")

But when I try to do any operation like .show() or .repartition(n) I run into the following error:

java.lang.IllegalArgumentException: Illegal Capacity: -102

any ideas on how I can fix this?



Solution 1:[1]

It's an integer overflow bug in the underlying parquet reader. https://issues.apache.org/jira/browse/PARQUET-1633

Upgrade PySpark to 3.2.1. The jar file parquet-hadoop-1.12.2 contains the code/actual fix.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jbaranski