'java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark
I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error:
df = spark.read.parquet("path/to/file.parquet")
But when I try to do any operation like .show()
or .repartition(n)
I run into the following error:
java.lang.IllegalArgumentException: Illegal Capacity: -102
any ideas on how I can fix this?
Solution 1:[1]
It's an integer overflow bug in the underlying parquet reader. https://issues.apache.org/jira/browse/PARQUET-1633
Upgrade PySpark to 3.2.1. The jar file parquet-hadoop-1.12.2 contains the code/actual fix.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | jbaranski |