My AWS Glue job runs for a long time. Or, my AWS Glue straggler task takes a long time to complete.
Resolution
Common factors that can cause AWS Glue jobs to run for a long time include configuration settings and the structure of the data and scripts.
The following steps help to optimize performance.
Set up metrics and logging
To identify issues and optimize performance, use AWS Glue's integrated monitoring tools, such as Amazon CloudWatch and job observability metrics.
Also, set up alerts for anomalies, and turn on the Apache Spark UI for better insight into the operation of the AWS Glue job. You can use the AWS Glue job run insights feature to understand the job's behavior while running in detail.
To activate metrics, complete one of the following.
From AWS Glue console
- Open the AWS Glue console.
- In the navigation pane, choose ETL Jobs.
- Select the job for which you want to turn on metrics.
- Choose Action, and then choose Edit job.
- In the Job Details tab, under Advanced options, choose Job metrics, Job observability metrics, Continuous logging, and Spark UI.
- Choose Save.
From CLI or SDK
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
From the API calls or CLI, pass the following parameters in the DefaultArguments parameter:
'--enable-metrics' : 'true'
'--enable-observability-metrics' : 'true'
'--enable-continuous-cloudwatch-log' : 'true'
'--enable-spark-ui' : 'true'
'--spark-event-logs-path' : '<s3://path/to/log/>'
Identify bottlenecks
To find bottlenecks, use driver logs from CloudWatch generated by the job run, or use Spark UI logs. AWS Glue 3.0 and later can handle skewed joins with Spark native features such as adaptive query execution (AQE). For more information, see Web UI on the Apache website.
Driver Logs
To check driver logs:
- Open the job from the AWS Glue console.
- On the Runs tab, select the job run that you want to check the logs for.
- In the CloudWatch log group /aws-glue/jobs/error, select the Log stream that has the same name as the job run ID that you opened the logs for. Look for streams that have names with suffix "_g-xxxxxx" - these are the executor logs.
- In the driver logs, check for tasks that run for a long time before they complete. In the following example, one task ran for 34 minutes, and the Spark job that this task was created for finished in 35 minutes.
2024-09-13 10:06:13,847 INFO [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logInfo(61)): Finished task 0.0 in stage 0.0 (TID 0) in 2054673 ms on 172.35.184.56 (executor 1) (1/1)
2024-09-13 10:06:13,894 INFO [Thread-13] scheduler.DAGScheduler (Logging.scala:logInfo(61)): Job 0 finished: save at DataSink.scala:666, took 2164.67 s
Spark UI logs
In the Spark UI logs, on the Jobs and Stages tabs, look for stages and tasks that ran for a long time.
Fix bottlenecks
Partition data more efficiently. AWS Glue jobs rely on distributed processing. If data isn't partitioned efficiently, then Spark workers might have to process large, unbalanced partitions. This time to process causes delays. To control the number of partitions, use the repartition() or coalesce() functions in Spark. Make sure that your data is well-partitioned to leverage the distributed nature of Spark. You can also configure AWS Glue DynamicFrame to partition data with splitFields or custom partition keys.
Increase capacity by adding worker nodes. If there aren't enough worker nodes to handle the volume of data, then the job runs slowly because of limited parallelism. Increase the number of worker nodes, or switch to a larger worker type. Make sure that you use the right worker size and that you allocated enough DPUs (Data Processing Units) to process your data efficiently. The number of tasks handled per executor is equal to four times the number of DPUs. For example, a G.1X worker type has one DPU and handles four tasks per executor. Note that the G.4X and G.8X worker types are available only for AWS Glue version 3.0 or later Spark ETL jobs.
Redistribute data to reduce skew across partitions. Data skew occurs when partitions have significantly different amounts of data. Nodes with more data are overworked while others are idle. Identify skewed keys by analyzing the data distribution. Redistribute or balance the data across partitions, or use the salting technique to spread out hot keys. AWS Glue 3.0 and later can handle skewed joins with Spark native features such as adaptive query execution (AQE) on the Apache website. For this, turn on adaptive query execution (AQE) and set spark.sql.adaptive.skewJoin.enabled to true. AQE is turned on by default starting from spark 3.2.0. To turn on AQE for Glue 3.0, add a parameter spark.sql.adaptive.enabled and set it to true.
Replace UDFs with native Spark functions. Custom User-Defined Functions (UDFs) or complex transformations can be expensive to run and can slow down Spark jobs. Avoid UDFs when possible, and rely on native Spark functions that are optimized for performance. If UDFs are necessary, rewrite in Scala rather than Python because Scala UDFs often perform better. Also, for better optimization, apply transformations using DataFrames or DynamicFrames.
Minimize shuffle operations. Shuffle operations, such as groupBy, join, or orderBy, transfer data across nodes. These can become bottlenecks if overused or if not managed properly. Minimize shuffle operations by filtering and aggregating data as early as possible in the transformation process. To avoid unnecessary data transfer, use broadcast joins where applicable. Also, make sure that the shuffled data is partitioned efficiently.
Remove unneeded caching. Overuse or improper use of caching can lead to increased memory consumption and can slow down the job. Use cache() or persist() only when you reuse a dataset multiple times in a workflow. Note the available memory and clear any cached datasets with unpersist() when those datasets are no longer needed.
Break down long dependency chains. If your job has a long chain of transformations, Spark recomputes the entire dependency chain. These actions can slow down processing. Break down complex jobs into smaller tasks. If needed, persist the intermediate results. This reduces the recomputation overhead and helps you to debug and monitor the performance of each step individually.
Reduce network latency and I/O operations. Reading and writing data to external sources such as Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), or Amazon Redshift, can introduce latency, especially with large datasets. Use AWS Glue's built-in connectors. Store data in a format that supports faster reads and writes, such as Parquet or ORC. Turn on S3 Transfer Acceleration for faster data transfer rates, and use the AWS Glue Data Catalog to optimize metadata retrieval.
Optimize native operations. Glue native operations, including job bookmarks and DynamoDB export connectors, can increase runtimes for an AWS Glue job. To check runtimes, identify in driver logs when the AWS Glue job started and when the job bookmarking tasks or export from DynamoDB ended. For DynamoDB, check for messages similar to the following example:
2024-08-24 03:33:37.000Z connections.DynamoExportConnection (DynamoExportConnection.scala:dynamodbexport(129)): Dynamodb Export complete...exported 712948751 item(s) or 4859215204353 byte(s)
Reduce the effect of job bookmarks
- To reduce the number of files scanned, consolidate small files into larger ones.
- Remove the partitions to an archive location when the processing is already done.
- Use efficient partition strategies so that AWS Glue can skip entire partitions that aren't changed.
- To filter data early, use pushdown predicates.
- Turn off job bookmarks temporarily if they don't add value to the job process.
Reduce the effect of DynamoDB connectors
- When possible, reduce the volume of data exported.
- To identify issues that cause the export delays, monitor DynamoDB and AWS Glue.
- To speed up export times, optimize DynamoDB table configurations.
- Where feasible, export data outside of the main job, or adjust schedules to avoid bottlenecks.
Related information
Using job parameters in AWS Glue jobs
Monitoring jobs using the Apache Spark web UI
Monitoring for DPU capacity planning