By using AWS re:Post, you agree to the Terms of Use

Glue slow on JSON to Parquet



I have a couple of glue jobs to convert JSON to Parquet from one S3 bucket to another. Average JSON file size between 1-2 KB, total files so far 1.6 million, the total size of all files so far is 1.6 GB. My glue jobs take between 3-4 minutes to process and run every 10 minutes. How can I improve the execution time of these glue jobs? Is keeping a large number of these small files in the source S3 bucket is an issue, thought of moving prior 30 days files to the glacier, will it improve run time?

asked 7 months ago233 views
1 Answer

Hi, Did you look at Spark UI to see whether the job is spending most of its time? If you are not processing all the files every 10 minutes it would be great to move those processed objects to another bucket to improve listing time or use partitions in order to improve the reading of those files.

Please read this link (Handle large number of small files and partition sections) Glue Best Practices


profile picture
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions