AWS Glue ETL“Failed to delete key: target_folder/_temporary” caused by S3

0

Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."

Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.

I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: Link: https://github.com/aws-samples/aws-glue-samples/issues/20

"I've had success reducing the number of workers."

However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.

Did anyone have any success who faced this issue? How would I go about solving it?

Shortened stacktrace:

py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
	...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
	...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...

Part of Glue ETL python script (just in case):

datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")

... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.

partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()
cell
asked 4 years ago1687 views
3 Answers
0

Hi,

Error "AmazonS3Exception: Please reduce your request rate." infers that job is hitting S3 limits.

To mitigate this issue, you can try either of given below methods:

Suggestion to avoid hitting S3 limits in Glue job:

  1. Reduce the number of DPU, then less number of executor will be launched so the number of parallel running task will be less. This may help to not exceed the s3 request limit per second, however the job completion time may increase.

  2. Instead of reducing the DPU number, you can use dataframe 'repartition' method to reduce the underlying partition count so the number of task will be reduced, repartition will cause the shuffling of data and with large data set job may face the memory issue.

I hope this helps.

Edited by: aronaanc-aws on Jan 20, 2020 10:24 PM

AWS
answered 4 years ago
0

The #2 is how I solved the issue prior to your comment but instead of repartition I used coalesce.

answered 4 years ago
0
  1. Would enabling s3 transfer acceleration help to increase the request limit?

Regarding reducing number of parallel writes. It seems that it comes down to writing data as bigger objects. Glue uses Spark and it creates _temporary folder to which it writes the data and then renames it. I assume that Glue is accessing the target_folder/_temporary as single end-point when writing data to S3 whatever the partition is and then renames the _temporary to a partition name, essentially creating different "endpoint" for S3. I have a feeling that this "different" endpoint is not being utilised as all the requests for different partitions are still going into _temporary endpoint thus reaching the request limit of 3500 requests per second.

Logically speaking increasing the number of partitions that Glue is writing to should reduce the number of requests towards one or the other endpoint. However, due to the behaviour and _temporary folder manipulations this does not work and instead one would reduce the number of partitions or number of parallel writes in order to write data in bigger chunks, essentially reducing number of requests towards single _temporary endpoint.

  1. Is there a way to "pre-create" partitions from Glue job script? Or is there a way of avoiding the _temporary folder creation and instead Glue could write directly into named folders? Or something among those lines?

Edited by: cell on Jan 23, 2020 6:30 AM

cell
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions