'driver size error in PySpark despite a very small dataset

I'm getting the following error in my PySpark job using AWS Glue (3.0):

An error occurred while calling o166.save. Job aborted due to stage failure:
Total size of serialized results of 73 tasks (1028.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)

The perplexing part is that my DataFrame only has 70k rows and 5 little columns (10 partitions). The query runs in Athena in fewer than 10 seconds. How could I be exceeding the driver maxResultSize with such a small job? Here's my code (10 G.1X workers):

import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "dynamic").getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

df = spark.sql("""select * from my_table""")

print(df.count())
print('writing')
df.write.format("parquet").partitionBy("my_col").mode("overwrite").save("s3://my_bucket/my_db/my_tbl/")
job.commit()

If I download the results to CSV, it's smaller than the maxResultsSize. I've adjusted the maxResultsSize to different values (I've tried "3g"). When I set it to unlimited (maxResultsSize = 0), I get an out of memory error.

# # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 8"...

What could I be missing? Besides number of partitions or data size, are there any other pyspark mishaps that could trigger this error?

pyspark aws-glue

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'driver size error in PySpark despite a very small dataset

Sources

Related Questions