How to control how result of Glue job is split into files?

0

When I run a glue job to create a new representation of some CSV files I use partitioning (on say year and month) but when I look in S3 there is not only one "file" created under the "directory hierarchy" y=1018/m=1 a whole bunch of small files (each about 9MB) are created.

  1. Can this behaviour be controlled? I.e. could I specify that each of my specified partitions would only result in say one file or specify my own desired size of each file split?

  2. Is this (~9MB) the optimal file size for use with Athena or Redshift Spectrum (from using Hadoop/HDFS I am used to go for much larger file sizes more in the 128 to 256 MB range)?

I am pretty new to Glue/Sparc programming so any suggestions, documentation links or code snippets (preferably Python as I am not a Scala developer) are warmly appreciated!

已提问 6 年前3112 查看次数
1 回答
0
已接受的回答

Hi,

  1. No, not directly. Spark parallelises the processing of your DataFrame by using partitioning. Each partition writes a separate CSV file. In theory if you force to your dataframe to use only n number of partitions you could "control" the file size, however is not recommended as the repartitioning is a relatively expensive operations. One way to control Spark Partitioning by either forcing a repartitioning (forces a full reshuffle of the data) . Another way to reduce the number of partitions is using coalesce(). coalesce() can reduce the number of partitions (not increase) without a full reshuffle of the data. For your problem, I wouldn't use any of these options though. Instead, I would look to merge those files at a later stage, after Spark has finished processing. Directly from our documentation "One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS."

  2. It is not an optimal file size for Athena. You are right for the 128MB/256MB range. Please have a look at the following links regarding Athena and Redshift Spectrum optimisations.

AWS
Manos_S
已回答 6 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则