- 最新
- 最多得票
- 最多評論
Hi,
Error "AmazonS3Exception: Please reduce your request rate." infers that job is hitting S3 limits.
To mitigate this issue, you can try either of given below methods:
Suggestion to avoid hitting S3 limits in Glue job:
-
Reduce the number of DPU, then less number of executor will be launched so the number of parallel running task will be less. This may help to not exceed the s3 request limit per second, however the job completion time may increase.
-
Instead of reducing the DPU number, you can use dataframe 'repartition' method to reduce the underlying partition count so the number of task will be reduced, repartition will cause the shuffling of data and with large data set job may face the memory issue.
I hope this helps.
Edited by: aronaanc-aws on Jan 20, 2020 10:24 PM
The #2 is how I solved the issue prior to your comment but instead of repartition I used coalesce.
- Would enabling s3 transfer acceleration help to increase the request limit?
Regarding reducing number of parallel writes. It seems that it comes down to writing data as bigger objects. Glue uses Spark and it creates _temporary folder to which it writes the data and then renames it. I assume that Glue is accessing the target_folder/_temporary as single end-point when writing data to S3 whatever the partition is and then renames the _temporary to a partition name, essentially creating different "endpoint" for S3. I have a feeling that this "different" endpoint is not being utilised as all the requests for different partitions are still going into _temporary endpoint thus reaching the request limit of 3500 requests per second.
Logically speaking increasing the number of partitions that Glue is writing to should reduce the number of requests towards one or the other endpoint. However, due to the behaviour and _temporary folder manipulations this does not work and instead one would reduce the number of partitions or number of parallel writes in order to write data in bigger chunks, essentially reducing number of requests towards single _temporary endpoint.
- Is there a way to "pre-create" partitions from Glue job script? Or is there a way of avoiding the _temporary folder creation and instead Glue could write directly into named folders? Or something among those lines?
Edited by: cell on Jan 23, 2020 6:30 AM
相關內容
- AWS 官方已更新 3 個月前
- AWS 官方已更新 2 年前
- AWS 官方已更新 3 年前