Is it optimal to keep one lengthy Glue job script, or split it into sub-modules/multiple files?

0

A customer is using a large python script to run a Glue ETL job. They are wondering if it is optimal to keep one lengthy Glue job script or to split it into sub-modules/multiple files?

I think this depends on the complexity of the Glue ETL job, but in general it's best practice to leverage as much in-parallel processing as possible and having sub-modules would make it easier to collaboratively develop/maintain the code.

asked 6 years ago1385 views
1 Answer
0
Accepted Answer

I would strongly recommend splitting the jobs if thats an option. It is not recommended to have a single big job needing lot of DPUs. For my example, I needed 700 DPUs to convert 14000 files of each 500 MB CSV and gzipped to parquet. I learnt that the best way to do this in Glue will be to split into 14 instances of the same Spark job with each job instance processing 1000 files using 50 DPUs per job instance. Basically, try to split the jobs IF doable and if you can't, you need a lot of DPUs, it might be better to look for a transient EMR cluster.

AWS
answered 6 years ago
profile picture
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions