How to control how result of Glue job is split into files?

0

When I run a glue job to create a new representation of some CSV files I use partitioning (on say year and month) but when I look in S3 there is not only one "file" created under the "directory hierarchy" y=1018/m=1 a whole bunch of small files (each about 9MB) are created.

  1. Can this behaviour be controlled? I.e. could I specify that each of my specified partitions would only result in say one file or specify my own desired size of each file split?

  2. Is this (~9MB) the optimal file size for use with Athena or Redshift Spectrum (from using Hadoop/HDFS I am used to go for much larger file sizes more in the 128 to 256 MB range)?

I am pretty new to Glue/Sparc programming so any suggestions, documentation links or code snippets (preferably Python as I am not a Scala developer) are warmly appreciated!

asked 6 years ago3079 views
1 Answer
0
Accepted Answer

Hi,

  1. No, not directly. Spark parallelises the processing of your DataFrame by using partitioning. Each partition writes a separate CSV file. In theory if you force to your dataframe to use only n number of partitions you could "control" the file size, however is not recommended as the repartitioning is a relatively expensive operations. One way to control Spark Partitioning by either forcing a repartitioning (forces a full reshuffle of the data) . Another way to reduce the number of partitions is using coalesce(). coalesce() can reduce the number of partitions (not increase) without a full reshuffle of the data. For your problem, I wouldn't use any of these options though. Instead, I would look to merge those files at a later stage, after Spark has finished processing. Directly from our documentation "One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS."

  2. It is not an optimal file size for Athena. You are right for the 128MB/256MB range. Please have a look at the following links regarding Athena and Redshift Spectrum optimisations.

AWS
Manos_S
answered 6 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions