'driver size error in PySpark despite a very small dataset
I'm getting the following error in my PySpark job using AWS Glue (3.0):
An error occurred while calling o166.save. Job aborted due to stage failure:
Total size of serialized results of 73 tasks (1028.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
The perplexing part is that my DataFrame only has 70k rows and 5 little columns (10 partitions). The query runs in Athena in fewer than 10 seconds. How could I be exceeding the driver maxResultSize with such a small job? Here's my code (10 G.1X workers):
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "dynamic").getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
df = spark.sql("""select * from my_table""")
print(df.count())
print('writing')
df.write.format("parquet").partitionBy("my_col").mode("overwrite").save("s3://my_bucket/my_db/my_tbl/")
job.commit()
If I download the results to CSV, it's smaller than the maxResultsSize. I've adjusted the maxResultsSize to different values (I've tried "3g"). When I set it to unlimited (maxResultsSize = 0), I get an out of memory error.
# # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 8"...
What could I be missing? Besides number of partitions or data size, are there any other pyspark mishaps that could trigger this error?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
