AWS Glue: DDB write_dynamic_frame_from_options fails with requests throttled

0

Hi, I have aws glue job written in Python that reads DDB table (cross accounts) and then attempts to write to another table in a current account. The code for writing DDB is very simple

def WriteTable(gc, dyf, tableName):
    gc.write_dynamic_frame_from_options(
        frame=dyf,
        connection_type="dynamodb",
        connection_options={
            "dynamodb.output.tableName": tableName,
            "dynamodb.throughput.write.percent": "1"
        }
    ) 

I run it on a small test table with 20 mil items and 500MB in size.

Avg item size is 28 bytes and below is the item sample.

The table has primary key as a pair + sort key date

{
 "pair": {
  "S": "AMDANG"
 },
 "date": {
  "N": "20080101"
 },
 "value": {
  "N": "0.005845"
 }
}

Read sink works just fine. Job reads items pretty fast with 3-5 workers and it takes a few minutes.

However, the write sink is a disaster. The requests get throttled and it barely can write 100k records a minute. I tried already all Glue versions and 3-10 workers.

It actually fails fast with retries exhausted. I had to change "dynamodb.output.retry" to 30-50 because default 10 just fails glue job as soon as it starts writing with: An error occurred while calling o70.pyWriteDynamicFrame. DynamoDB write exceeds max retry 10

ddb metrics screenshot when glue writes Lots of Write Throttle events.

What could be the problem here? I thought it could be a primary key distribution problem, but they should be well distributed. I've looked at the pair distinct count, there are a few thousand pairs. The dates are daily sequences for multiple years.

Ref: https://github.com/awslabs/aws-glue-libs/issues/104

asked 2 years ago5412 views
2 Answers
3

The issue is that DynamoDB cannot auto-scale fast enough to keep up with the AWS Glue write speed. To avoid DynamoDB ThrottlingException on write please use Capacity mode “Provisioned” with Autoscaling for Read and Write. Choose the min / max capacity wisely. Set the target utilization to allow DynamoDB sufficient time to scale. As an example I ran this blog successfully setting the number of workers in the Glue job to 25 and Provisioned Table capacity with autoscaling for DDB for set to Read 5 (min) 40K (max) TU 70% Write 5K / 20K (min) 40K (max) TU 35% Total records / items : 14,092,383

Essentially providing enough time for DynamoDB to auto scale.

AWS
Kunal_G
answered 2 years ago
EXPERT
John_F
reviewed 2 years ago
  • That's interesting, thank you. I left the table in "on-demand" mode as recommended here https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

    When the DynamoDB table is in on-demand mode, AWS Glue handles the write capacity of the table as 40000. For importing a large table, we recommend switching your DynamoDB table to on-demand mode.

    Isn't ddb suppose to handle 40k WCU pretty effortlessly for on-demand mode?

  • @catenary-dot-cloud yes I know that is recommended in our docs and I tried running the above blog by setting the DDB to on-demand and not surprisingly that failed. I guess with the release of AWS Glue 3.0 that recommendation needs a re-look. BTW this load to DDB via AWS Glue will be added to upcoming AWS Glue workshops.

1

While using on-demand table is recommended, it too must scale to provide your throughput needs. DynamoDB on-demand tables provide 4k WCU out of the box, so your ETL job will be initially capped at 4k WCU, then throttle until DynamoDB makes the necessary changes to accommodate your request throughput. To achieve best results, you must first "pre-warm" your table, this is easier done on new tables, but will also work for old ones.

  1. Create table in provisioned mode with no auto-scaling, assign 40k WCU and 40 RCU for capacity settings.
  2. Allow table to go from Creating/Updating to Activce
  3. Once Active, change capacity mode to on-demand.
  4. You can now achieve approx 40k WCU out of the box, and your ETL job will have the capacity it needs to run with no issue.

*note that 40k WCU is the soft limit for max table capacity, and can be increased with a request from Service Quotas.

profile pictureAWS
EXPERT
answered 2 years ago
  • Leeroy, thanks for the additional information. I figured out the capacity part by trial and error The Glue job started writing at provisioned capacity.

    However, I had to change the composite key.

    Before: table with 3 mil items Composite primary key: AAABBB (a currency pair, USDCAD for example) + sort key: yyyymmdd (date) Item: just a number Avg item size: 28 bytes, table size 500MB

    There were around 4000 unique primary keys for this 3 mil items table. Each had about 1000 items. Around 600 sort keys (2 year period) for each partition key

    With this schema, DDB was able to write only at 1.3k WCUs. Even with provisioned capacity at 5-10k WCUs

    Changing schema to abcdez#yyyy with search key mmdd allowed writers at half of provisioned WCU And only with scheme abcdez#yyyymm with search key dd it allowed writers at full provisioned capacity.

    Why couldn't ddb write with provisioned WCU with this partition key? I didn't see anywhere in the docs anything like "if a primary key has too many sort keys, the writes will be slow"

  • Its not that it has too many SK's, its the fact that a partition has a hard limit of 1000WCU. Your items are partitioned based on a hashed version of your partition key, meaning each key can only have 1000WCU. DynamoDB does have the mechanics to split a keys items into multiple partitions, but you will get throttled before that happens.

    My suggestion is to try shuffle your data before writing. Glue basically provides an abstraction for BatchWriteItems API, wherein you are providing DynamoDB a batch of items with the same partition key, which in turn directs traffic to a single partition and hitting the 1000 WCU cap. As Glue is distributed in nature, it will not seem obvious that this is happening, but in essence you are having multiple hot partitions. Hence, I suggest a shuffle in your data you allow a more even distribution.

  • Problem is called "hot partition". Consider changing the partition key :-) A guid usually works fine. I think you still can make an index on the AAABBB key, the index will eventually catch up.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions