'how can i avoid OOMs error in AWS Glue Job in pyspark

I am getting this error while running AWS Glue job using 40 workers and processing 40GB data

Caused by: org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5fa14240 : No space left on device

How can i optimize my job to avoid such error on pyspark

Here is the pic of metrics glue_metrics

Solution 1:^[1]

AWS Glue Spark shuffle manager with Amazon S3

Requires using Glue 2.0

See the following links.

https://awscloudfeed.com/whats-new/big-data/introducing-amazon-s3-shuffle-in-aws-glue
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	semaphore

'how can i avoid OOMs error in AWS Glue Job in pyspark

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]