Optimizing AWS Glue Job to Achieve 20-Second Data Transfer from On-Prem SQL Server to AWS RDS

0

Hi AWS Experts,

I am working on an AWS Glue job to transfer data from an on-premise SQL Server to AWS RDS (SQL Server). My goal is to achieve a 20-second execution time, but despite multiple optimizations, I haven’t been able to reach the target yet.

Current AWS Glue Job Setup:

  1. AWS Glue Version: 5.0 (Supports Spark 3.5, Python 3)
  2. Worker Type: G4X
  3. Number of Workers: 6
  4. Source: On-Prem SQL Server
  5. Target: AWS RDS (SQL Server)

Optimizations Implemented So Far: Infrastructure-Level Adjustments:

  1. Upgraded from G1X to G4X for better memory and compute power.
  2. Increased DPUs from 1 to 6 for better parallelism.

Spark Configuration Tweaks:

  1. Shuffle partitions reduced from 200 to 25.
  2. Parallelism increased (spark.default.parallelism = 200).
  3. Adaptive Query Execution (AQE) enabled (spark.sql.adaptive.enabled = true).
  4. Broadcast joins used for small tables to minimize shuffle.

Data Processing & Query Optimization:

  1. Applied filtering at the SQL source level to reduce data transfer (ChangeDate >= current_timestamp() - INTERVAL 24 HOURS).
  2. Used Spark distributed processing for JSON conversion.
  3. JDBC Batch Inserts: Increased batch size (5000 records per batch).
  4. Used ThreadPoolExecutor for parallel stored procedure execution in RDS.

Issues Faced:

  1. JDBC Read from On-Prem SQL Server still takes ~4-6 seconds.
  2. Spark Transformations & Joins take ~50 seconds despite optimizations.
  3. JDBC Insert into AWS RDS takes ~18-20 seconds, limiting performance.
  4. AWS Glue logs show shuffle and stage processing as key bottlenecks.

Questions & Guidance Needed:

  1. How can I further reduce JDBC read time from the on-prem SQL Server?
  2. Are there any additional Spark optimizations to reduce query execution time?
  3. How can I speed up JDBC inserts into AWS RDS? (Connection pooling, partitioning, etc.?)
1 Answer
0

To further optimize your AWS Glue job and achieve the 20-second execution time target, consider the following strategies:

  1. Reducing JDBC read time from on-prem SQL Server:

    • Implement parallel reads by partitioning your data. Use a partitioning column (e.g., ID or timestamp) and set lower and upper bounds for each partition.
    • Adjust the 'fetchsize' parameter to optimize the amount of data fetched in each network round trip.
    • If possible, create indexes on frequently queried columns in your source database.
  2. Additional Spark optimizations to reduce query execution time:

    • Fine-tune the number of partitions. Instead of reducing to 25, try setting it closer to the number of cores across your workers (e.g., spark.sql.shuffle.partitions = 48 for 6 G4X workers with 8 cores each).
    • Enable dynamic partition pruning (spark.sql.dynamicPartitionPruning.enabled = true) to reduce data scanned during joins.
    • Use appropriate compression codecs (e.g., spark.sql.parquet.compression.codec = "snappy") for intermediate data.
    • Consider using Spark's cache() or persist() methods for frequently accessed DataFrames.
  3. Speeding up JDBC inserts into AWS RDS:

    • Implement connection pooling to reduce the overhead of creating new connections.
    • Use bulk insert operations instead of individual inserts.
    • Consider writing data to S3 first, then using AWS RDS's bulk import features for faster data loading.
    • If possible, disable autocommit and manually control transactions for better performance.

Additional general optimizations:

  • Use the latest AWS Glue version (currently 4.0) which includes performance improvements and bug fixes.
  • Monitor and tune the number of executors and executor memory based on your job's requirements.
  • Optimize your Spark SQL queries by pushing down predicates and projections to reduce data transfer.
  • Use efficient data formats like Parquet for intermediate data storage.
  • Consider using AWS Glue DataFrames instead of Spark DataFrames for better integration with AWS services.

Remember to monitor your job's performance using AWS Glue's built-in profiling tools and Apache Spark's web UI. This will help you identify specific bottlenecks and focus your optimization efforts. Also, consider breaking down your job into smaller, more manageable tasks if the 20-second target proves unattainable for the entire process.
Sources
Improving performance for AWS Glue for Apache Spark jobs - AWS Glue
Best practices for performance tuning AWS Glue for Apache Spark jobs -
AWS Glue job takes very long time to write a spark dataframe on S3 | AWS re:Post
Optimize shuffles -

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions