AWS Glue Dynamic Frame – JDBC Performance Tuning Configuration

3 minute read
Content level: Intermediate
0

This document lists the options for improving the JDBC source query performance from AWS Glue dynamic frame by adding additional configuration parameters to the ‘from catalog’ method of ‘create dynamic frame’ function

AWS Glue Dynamic Frame – JDBC Performance Tuning Configuration

This document lists the options for improving the JDBC source query performance from AWS Glue dynamic frame by adding additional configuration parameters to the ‘from catalog’ method of ‘create dynamic frame’ function. Four different options are discussed as given below

  1. Parallel JDBC Query Execution (Single Table)

  2. Push Down Predicate (Single Table)

  3. Query Push Down (Multi Table)

  4. Parallel JDBC Query Execution By default JDBC queries create a single connection while fetching data from the source but the number of parallel executions can be changed by using the ‘hashfield’ and ‘hashpartitions’ parameters within the ‘from_catalog’ method as given below. ‘hashfield’: Set hashfield to the name of a column in the JDBC table to be used to divide the data into partitions. For best results, this column should have an even distribution of values to spread the data between partitions. This column can be of any data type. ‘haspartitions’: Set hashpartitions to the number of parallel reads of the JDBC table. If this property is not set, the default value is 7. Notes:

  5. ‘hashpartitions’ won’t work with Oracle databases.

  6. ‘hashexpression’ can be used instead of the ‘hashfield’ too Code Snippet: JDBC_DF = glueContext.create_dynamic_frame.from_catalog( database="dms", table_name="dms_large_dbo_person", transformation_ctx="JDBC_DF", additional_options = { 'hashfield': 'last_name', 'hashpartitions': '10' } )

  7. Predicate Push Down Above explained parallel execution option allows users to increase the parallel executions but if users want to filter the data at the source ‘push down predicate’ parameters as given below would be the option. ‘hashexpression’: AWS Glue generates SQL queries to read the JDBC data in parallel using the hashexpression in the WHERE clause to partition data. ‘enablePartitioningForSampleQuery’: By default this option is false. Required if you want to use sampleQuery with a partitioned JDBC table. If set to true, sampleQuery must end with "where" or "and" for AWS Glue to append partitioning conditions ‘sampleQuery’: The custom SQL query statement for sampling. By default the sample query is executed by single executor. If you're reading a large dataset, you may need to enable JDBC partitioning to query a table in parallel. Code Snippet: JDBC_DF_PDP = glueContext.create_dynamic_frame.from_catalog( database="dms", table_name="dms_large_dbo_person", transformation_ctx="JDBC_DF_PDP", additional_options = { "hashexpression":"id", "enablePartitioningForSampleQuery":True, "sampleQuery":"select * from person where last_name <> 'rb' and"} )

Screenshot 3: Spark UI with 7 executors. Default partition/executor count of 7 is used, users can modify it by adding the ‘hashpartitions’ parameter to the dynamic frame script. 3. Query Push Down Queries with multiple tables can be executed with a similar approach as the second option as given below. ‘hashexpression’: AWS Glue generates SQL queries to read the JDBC data in parallel using the hashexpression in the WHERE clause to partition data. In case of queries with multiple tables assign a numeric attribute with distinct identifier (table_name.column_name, as given below). Code Snippet: JDBC_DF_QUERY = glueContext.create_dynamic_frame.from_catalog( database="dms_large", table_name="dms_large_dbo_sporting_event", transformation_ctx="JDBC_DF_QUERY", additional_options = {"hashpartitions": "20" ,"hashfiled":"pr.id","hashexpression":"pr.id", "enablePartitioningForSampleQuery":True, "sampleQuery":"select pr.id, fl.full_name from dms_large.dbo.person pr inner join dms_large.dbo.person_full fl on pr.id = fl.id and"} )

Note: Please refer the below AWS documentation for details

https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html

  1. Glue Studio - Custom Connector

Custom Connector based option can be used to leverage a SQL

https://docs.aws.amazon.com/glue/latest/ug/connectors-chapter.html

https://aws.amazon.com/blogs/big-data/developing-testing-and-deploying-custom-connectors-for-your-data-stores-with-aws-glue/

AWS
EXPERT
AWSRB
published 10 months ago2106 views
3 Comments

Thanks. I'm using from_options as I have not cataloged the source JDBC tables; I'm reading directly for the source oracle table. Can you please suggest how to create glue dynamic frame using from_options and pushdown predicates/hashfield/hashexpression. I could not find any examples regarding these given in Amazon documentation. Your inputs and examples would really help many who are working on similar requirement

Mayura
replied 8 months ago

I have the same question as Mayura posted :)

Jannic
replied 5 months ago
AWS
SUPPORT ENGINEER
replied 12 days ago