How to control how result of Glue job is split into files?

0

When I run a glue job to create a new representation of some CSV files I use partitioning (on say year and month) but when I look in S3 there is not only one "file" created under the "directory hierarchy" y=1018/m=1 a whole bunch of small files (each about 9MB) are created.

  1. Can this behaviour be controlled? I.e. could I specify that each of my specified partitions would only result in say one file or specify my own desired size of each file split?

  2. Is this (~9MB) the optimal file size for use with Athena or Redshift Spectrum (from using Hadoop/HDFS I am used to go for much larger file sizes more in the 128 to 256 MB range)?

I am pretty new to Glue/Sparc programming so any suggestions, documentation links or code snippets (preferably Python as I am not a Scala developer) are warmly appreciated!

posta 6 anni fa3112 visualizzazioni
1 Risposta
0
Risposta accettata

Hi,

  1. No, not directly. Spark parallelises the processing of your DataFrame by using partitioning. Each partition writes a separate CSV file. In theory if you force to your dataframe to use only n number of partitions you could "control" the file size, however is not recommended as the repartitioning is a relatively expensive operations. One way to control Spark Partitioning by either forcing a repartitioning (forces a full reshuffle of the data) . Another way to reduce the number of partitions is using coalesce(). coalesce() can reduce the number of partitions (not increase) without a full reshuffle of the data. For your problem, I wouldn't use any of these options though. Instead, I would look to merge those files at a later stage, after Spark has finished processing. Directly from our documentation "One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS."

  2. It is not an optimal file size for Athena. You are right for the 128MB/256MB range. Please have a look at the following links regarding Athena and Redshift Spectrum optimisations.

AWS
Manos_S
con risposta 6 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande