How do I optimize my AWS Glue ETL workloads when reading from or writing to Amazon DynamoDB?
I want to optimize my AWS Glue extract, transform, and load (ETL) job for reading from or writing to Amazon DynamoDB. -or- My AWS Glue ETL job causes a throttling exception for my DynamoDB table.
Before you create an AWS Glue ETL job to read from or write to a DynamoDB table, consider the following configuration updates. These updates can help optimize the use of resources in AWS Glue and DynamoDB.
Read from DynamoDB
- dynamodb.throughput.read.percent: This configuration variable indicates the percentage of read capacity units (RCU) to be used. The default value of this variable is set to 0.5. Acceptable values range from 0.1 to 1.5, inclusive. A value of 0.5 indicates that the AWS Glue job attempts to consume half the read capacity of the table. The actual read rate varies depending on factors, such as whether there is a uniform key distribution in the DynamoDB table.
Use this example to calculate the approximate run time of your job: Suppose that you provisioned 100 RCUs for your DynamoDB table. With 1 RCU, you can read 4 KB of data. With 100 RCUs, you can perform 100 reads of 409,600 bytes per second. Suppose that your table has 20 GB (21,474,836,480 bytes) of data, and you have set the value of dynamodb.throughput.read.percent to 1.0. This means that your job performs a full table scan with 100% of RCUs.
Then, you can calculate the approximate run time of your job as follows:
Size of your table / Bytes read per second = 21,474,836,480 / 409,600 = 52,429 seconds = 14.56 hours
To reduce the run time of your job, you can increase the number of RCUs by setting the appropriate value for dynamodb.throughput.read.percent. For more information, see Read capacity.
If your DynamoDB table is large, then choose the On-demand read/write capacity mode for your table instead of Provisioned mode. You can choose the on-demand read/write capacity mode when creating a new table or update this setting in the Capacity tab for existing tables. For more information, see Read/Write capacity mode.
- dynamodb.splits: This connection option defines the number of splits the table is partitioned into while reading the data. The default value is set to 1. Acceptable values range from 1 to 1,000,0000, inclusive. The value of 1 represents that there is no parallelism. It's a best practice to set a higher value for this variable. The amount of parallelism that can achieved depends on the AWS Glue worker type and the number of workers configured for the job. For calculating numSlots that can be used as the value for dynamodb.splits, use the formula under "dynamodb.splits" in "connectionType": "dynamodb" as source.
Write into DynamoDB
- dynamodb.throughput.write.percent: This configuration variable defines the percentage of write capacity units (WCU) to use. The default value is set to 0.5. Acceptable values range from 0.1 to 1.5, inclusive. A value of 0.5 means that AWS Glue attempts to consume half of the write capacity of the table. The actual write rate varies depending on factors, such as whether there is a uniform key distribution in the DynamoDB table. Update the value of this variable based on your use case.
If your DynamoDB table is large, then it's a best practice to choose the On-demand read/write capacity mode for your table instead of Provisioned mode.
For more information, see "dynamodb.throughput.write.percent" under "connectionType": "dynamodb" as Sink.
- dynamodb.output.numParallelTasks: This connection option defines the number of parallel tasks that write into DynamoDB at the same time. Note that this is an optional parameter. If you don't specify this parameter, then the numParallelTasks value is calculated based on numPartitions, numSlots, and numExecutors. This value can be used as the value for dynamodb.output.numParallelTasks. For the detailed formula, see "dynamodb.output.numParallelTasks" under "connectionType": "dynamodb" as Sink.
For code examples that show how to read from and write to DynamoDB tables, see Code examples.
- ¿Cómo puedo resolver el error «No queda espacio en el dispositivo» en un trabajo de ETL de AWS Glue?OFICIAL DE AWSActualizada hace un año
- ¿Por qué mi trabajo de ETL de AWS Glue reprocesa los datos aunque haya habilitado marcadores de trabajo?OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace 3 años