- Newest
- Most votes
- Most comments
Have you followed the How can I configure an AWS Glue ETL job to output larger files? article? It contains several suggestions to control the output file sizes.
The provided example is using groupSize in the 'from_options' method. I can clarify that the groupFiles method enables 'the grouping of files within an Amazon S3 data partition', this method is specific to the reading of files from S3. [1][2] It is true that repartitioning data will involve shuffling of data, but the number of files you write in Spark is determine by the number of partitions. We can reduce the impact of such partition management in the case of controlling the size of written files by pushing the repartitioning action close to your write operation to ensure your other operations transforming your data parallelize effectively.
References [1] https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html [2] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-s3-home.html#aws-glue-programming-etl-connect-s3
I am bit confused may be can you please provide example with respect to my current example in question ? I am using data catalog while creating frame
Absolutely. In your write action, you are passing the groupFiles and groupSize in the connection_options parameter. You are reading from a catalog table, and writing to an S3 path using from options. From my understanding, you want to know why groupSize with groupFiles does not control the size of your files on write.
Searching the AWS documentation, we see that the group files has utility in create_dynamic_frame options and supports reading files into groups. We see in the documentation, 'When you set certain properties, you instruct AWS Glue to group files within an Amazon S3 data partition and set the size of the groups to be read. You can also set these options when reading from an Amazon S3 data store with the create_dynamic_frame.from_options method.' -https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html -https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/optimize-the-etl-ingestion-of-input-file-size-on-aws.html
To write files of a certain size, you will need to calculate the size of the partitions based on the size of the data you are processing using the methods in this article. Note that even if you alter your read operation to group by a file size, spark will automatically optimize the write process across its default partitions unless your alter this by writing by partitions, or by otherwise overly settings the number of partitions with coalesce() or reparation(). -https://repost.aws/knowledge-center/glue-job-output-large-files
I have tried from the aws article which you have shared:
AWSGlueDataCatalog_node075257312 = glueContext.create_dynamic_frame.from_catalog(connection_type="s3", format="csv", database="ardt", table_name="_ardt_rw_om__bts", connection_options={ "paths": [output_path], "recurse": True, "groupFiles": 'inPartition', "groupSize": 10485760 }, transformation_ctx="AWSGlueDataCatalog_node075257312")
I dont know how to use below logic while writing into frame, Could you please help ?
df.write.option("compression", "gzip").option("maxRecordsPerFile",20).json(s3_path)
Relevant content
- asked 17 days ago
- asked 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Yes i have checked this article. I have already tried Increase the value of the groupSize parameter but it did not worked. I dont want to use coalesce(N) as i have seen in many article it will affect the performance. I dont know to be honest how to use repartition(N) or maxRecordsPerFile option. Maybe you can provide some idea ?