How do I configure an AWS Glue ETL job to output larger files?

3 minuto de leitura
0

I want to configure an AWS Glue ETL job to output a small number of large files instead of a large number of small files.

Resolution

Increase the value of the groupSize parameter

When you use dynamic frames and the Amazon Simple Storage Service (Amazon S3) dataset has more than 50,000 files, the AWS Glue ETL job automatically groups files. To create fewer, larger output files, increase the groupSize value. For more information, see Reading input files in larger groups.

In the following example, groupSize is set to 10485760 bytes, or approximately 10 MB:

dyf = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://awsexamplebucket/"], 'groupFiles': 'inPartition', 'groupSize': '10485760'}, format="json")

Note: The groupSize and groupFiles parameters are supported only in the .csv, .ion, .grokLog, .json, and .xml file formats. The .avro, .parquet, and .orc file formats don't support these parameters.

Use coalesce or repartition

Complete the following steps:

  1. (Optional) Calculate your target number of partitions (N) based on the input data set size. Use the targetNumPartitions = 1 Gb * 1000 Mb/10 Mb = 100 formula to control the size of your output file.
    Note: In the preceding formula, the input size is 1 GB, the target output size is 10 MB, and the target number of partitions is 100.

  2. To check the current number of partitions, run the following command:

    currentNumPartitions = dynamic_frame.getNumPartitions()
  3. To reduce the number of output files, decrease the number of Apache Spark output partitions before you write to Amazon S3. To decrease the number of partitions, use the Spark coalesce function:

    dynamic_frame_with_less_partitions=dynamic_frame.coalesce(targetNumPartitions)

    Note: If targetNumPartitions is too small, then the job might fail because of disk space issues.
    -or-
    Use the Apache Spark repartition function:.

    dynamic_frame_with_less_partitions=dynamic_frame.repartition(targetNumPartitions)

    Note: To generate larger files, set targetNumPartitions to a smaller value than currentNumPartitions.

The coalesce and repartition functions might increase your job run time because both functions reshuffle your data. However, the coalesce function uses existing partitions to minimize the number of data shuffles.

Use maxRecordsPerFile

Use maxRecordsPerFile in the Spark write method to increase the maximum record count for each file. The following example sets the maximum record count to 20:

df.write.option("compression", "gzip").option("maxRecordsPerFile", 20).json(s3_path)

Note: The maxRecordsPerFile option sets a higher quota for the record count for each file. The record count of each file might still be less than the value of maxRecordsPerFile. If you set maxRecordsPerFile to zero or negative, then there's no record count quota.

Related information

Fix the processing of multiple files using grouping

AWS OFICIAL
AWS OFICIALAtualizada há 6 meses