'AWS Glue (Spark) very slow

I've inherited some code that runs incredibly slowly on AWS Glue.

Within the job it creates a number of dynamic frames that are then joined using spark.sql. Tables are read from a MySQL and Postgres db and then Glue is used to join them together to finally write another table back to Postgres.

Example (note dbs etc have been renamed and simplified as I can't paste my actual code directly)

jobName = args['JOB_NAME']
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(jobName, args)
    
# MySQL
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "trans").toDF().createOrReplaceTempView("trans")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "types").toDF().createOrReplaceTempView("types")
glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "currency").toDF().createOrReplaceTempView("currency")

# DB2 (Postgres)
glueContext.create_dynamic_frame.from_catalog(database = "db2", table_name = "watermark").toDF().createOrReplaceTempView("watermark")

# transactions
new_transactions_df = spark.sql("[SQL CODE HERE]")

# Write to DB
conf_g = glueContext.extract_jdbc_conf("My DB")
url = conf_g["url"] + "/reporting"

new_transactions_df.write.option("truncate", "true").jdbc(url, "staging.transactions", properties=conf_g, mode="overwrite")

The [SQL CODE HERE] is literally a simple select statement joining the three tables together to produce an output that is then written to the staging.transactions table.

When I last ran this it only wrote 150 rows but took 9 minutes to do so. Can somebody please point me in the direction of how to optimise this?

Additional info:

  • Maximum capacity: 6
  • Worker type: G.1X
  • Number of workers: 6


Solution 1:[1]

The Glue spark cluster usually takes 10 minutes only for startup. So that time(9 minutes) seems reasonable(unless you run Glue2.0, but you didn't specify the glue version you are using).

https://aws.amazon.com/es/about-aws/whats-new/2020/08/aws-glue-version-2-featuring-10x-faster-job-start-times-1-minute-minimum-billing-duration/#:~:text=With%20Glue%20version%202.0%2C%20job,than%20a%2010%20minute%20minimum.

Solution 2:[2]

Enable Metrics:

AWS Glue provides Amazon CloudWatch metrics that can be used to provide information about the executors and the amount of done by each executor. You can enable CloudWatch metrics on your AWS Glue job by doing one of the following:

Using a special parameter: Add the following argument to your AWS Glue job. This parameter allows you to collect metrics for job profiling for your job run. These metrics are available on the AWS Glue console and the CloudWatch console.

   Key: --enable-metrics

Using the AWS Glue console: To enable metrics on an existing job, do the following:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Jobs.
  3. Select the job that you want to enable metrics for.
  4. Choose Action, and then choose Edit job.
  5. Under Monitoring options, select Job metrics.
  6. Choose Save.

Courtesy: https://softans.com/aws-glue-etl-job-running-for-a-long-time/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nuno Carvalho
Solution 2 GHULAM NABI