Glue ETL generating too many files in S3

0

hi team, can I ask why Glue is generating so many parquet files from my ETL job? Enter image description here Enter image description here Enter image description here Enter image description here

profile pictureAWS
EXPERTE
gefragt vor 8 Monaten338 Aufrufe
2 Antworten
0

The number of output files correlates to the number of partitions spark is processing in your pipeline. You could look at settings like spark.sql.shuffle.partitions or you could repartition your data frame to reduced partitions.

That being said, you might not want to do this as it will slow your job down (less partitions to parallelize on) and whatever is consuming these files might also be slowed. For example, if you are loading these parquet files into redshift it will certainly be better to have multiple files to parallelize loading. Most consumers will prefer multiple files for the same reason.

tjtoll
beantwortet vor 8 Monaten
profile pictureAWS
EXPERTE
überprüft vor 8 Monaten
0

Since you are using a visual job, before you save add the component "Autobalance Processing", in the optional box you can enter the number of files but it's better if you leave it empty, the component will optimize the performance while having a reasonable number of files.

profile pictureAWS
EXPERTE
beantwortet vor 8 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen