How do I optimize AWS Glue streaming jobs?

2 分的閱讀內容
0

I want to optimize my AWS Glue streaming job.

Resolution

To optimize your AWS Glue streaming job, adhere to the following best practices:

  • Use Amazon CloudWatch to monitor AWS Glue streaming job metrics. Verify that the job has enough CPU, memory, and executors to manage the incoming data rate.
  • Make sure that the batch interval is greater than the batch processing time. For instructions on how to reduce the batch processing time, see Reducing the batch processing times on the Apache Spark website.
  • Use the AWS Glue Data Catalog to store and manage your data schema.
  • Use an efficient data serialization format, such as Apache Parquet. For more information about Apache Parquet, see Parquet files on the Apache Spark website.
  • Activate checkpointing. For more information, see Checkpointing on the Apache Spark website.
  • Use an open table format, such as Apache Iceberg. If the streaming job doesn't use data lake tables, then migrate the tables to Iceberg. Be sure to fine tune the commits, expire old snapshots, and use compaction to clean up metadata files. For more information about how to maintain Iceberg tables, see Recommended maintenance on the Apache Iceberg website.
  • Activate Auto Scaling.
    Note: Auto Scaling is available only for AWS Glue version 3.0 or later.
  • Adhere to best practices when you use or create extract, transform, and load (ETL) jobs.

Tune Spark streaming

Use the following configurations to tune Spark Streaming for streaming:

  • To activate back pressure to control the streaming job's receiving rate and compensate for scheduling delays and processing times use, spark.streaming.backpressure.enabled.
  • To increase the maximum rate that the receivers can receive data, use spark.streaming.receiver.maxRate. Set this value in records per second.
  • To activate write ahead logs (WALs) for receivers, use spark.streaming.receiver.writeAheadLog.enable. You can use WALs to recover data after driver failures.
  • To Define a size window that creates a small number of files, use windowSize. For more information, see Sampling input stream for interactive development.
  • To tune shuffle partitions to efficiently shuffle the data for wide transformations, use spark.sql.shuffle.partition.

For more information, see Performance tuning on the Apache Spark website.

Related information

How do I troubleshoot AWS Glue streaming jobs?

Amazon Kinesis connections

Kafka connections

Developing using a Docker image

AWS 官方
AWS 官方已更新 5 個月前