AWS Glue repartition with certain file size in python

0

I have below Python script where currently it generates several gz files with size 4MB in S3 bucket. Its bydeafult what AWS glue has created. But now i want to create multiple files of each file size 100MB in s3 bucket. I have tried below logic in python script but it did not work and still creates several gz files with size 4MB. I dont know how we can repartition here with to generate file with certain file size.

`

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datetime


args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

outputbucketname = args['target_BucketName']

timestamp = datetime.datetime.now().strftime("%Y%m%d")
filename = f"tbd{timestamp}"
output_path = f"{outputbucketname}/{filename}"


# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node075257312 = glueContext.create_dynamic_frame.from_catalog(database="ardt", table_name="_ard_tbd", transformation_ctx="AWSGlueDataCatalog_node075257312")

# Script generated for node Amazon S3
AmazonS3_node075284688 = glueContext.write_dynamic_frame.from_options(frame=AWSGlueDataCatalog_node075257312, connection_type="s3", format="csv", format_options={"separator": "|"}, connection_options={"path": output_path, "compression": "gzip",  "recurse": True, "groupFiles": "inPartition", "groupSize": "100000000"}, transformation_ctx="AmazonS3_node075284688")


job.commit()

`

RahulD
asked 24 days ago48 views
2 Answers
2

Have you followed the How can I configure an AWS Glue ETL job to output larger files? article? It contains several suggestions to control the output file sizes.

profile pictureAWS
EXPERT
answered 24 days ago
profile picture
EXPERT
reviewed 23 days ago
  • Yes i have checked this article. I have already tried Increase the value of the groupSize parameter but it did not worked. I dont want to use coalesce(N) as i have seen in many article it will affect the performance. I dont know to be honest how to use repartition(N) or maxRecordsPerFile option. Maybe you can provide some idea ?

0

The provided example is using groupSize in the 'from_options' method. I can clarify that the groupFiles method enables 'the grouping of files within an Amazon S3 data partition', this method is specific to the reading of files from S3. [1][2] It is true that repartitioning data will involve shuffling of data, but the number of files you write in Spark is determine by the number of partitions. We can reduce the impact of such partition management in the case of controlling the size of written files by pushing the repartitioning action close to your write operation to ensure your other operations transforming your data parallelize effectively.

References [1] https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html [2] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-s3-home.html#aws-glue-programming-etl-connect-s3

AWS
answered 21 days ago
  • I am bit confused may be can you please provide example with respect to my current example in question ? I am using data catalog while creating frame

  • Absolutely. In your write action, you are passing the groupFiles and groupSize in the connection_options parameter. You are reading from a catalog table, and writing to an S3 path using from options. From my understanding, you want to know why groupSize with groupFiles does not control the size of your files on write.

    Searching the AWS documentation, we see that the group files has utility in create_dynamic_frame options and supports reading files into groups. We see in the documentation, 'When you set certain properties, you instruct AWS Glue to group files within an Amazon S3 data partition and set the size of the groups to be read. You can also set these options when reading from an Amazon S3 data store with the create_dynamic_frame.from_options method.' -https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html -https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/optimize-the-etl-ingestion-of-input-file-size-on-aws.html

    To write files of a certain size, you will need to calculate the size of the partitions based on the size of the data you are processing using the methods in this article. Note that even if you alter your read operation to group by a file size, spark will automatically optimize the write process across its default partitions unless your alter this by writing by partitions, or by otherwise overly settings the number of partitions with coalesce() or reparation(). -https://repost.aws/knowledge-center/glue-job-output-large-files

  • I have tried from the aws article which you have shared:

    AWSGlueDataCatalog_node075257312 = glueContext.create_dynamic_frame.from_catalog(connection_type="s3", format="csv", database="ardt", table_name="_ardt_rw_om__bts", connection_options={ "paths": [output_path], "recurse": True, "groupFiles": 'inPartition', "groupSize": 10485760 }, transformation_ctx="AWSGlueDataCatalog_node075257312")

    I dont know how to use below logic while writing into frame, Could you please help ?

    df.write.option("compression", "gzip").option("maxRecordsPerFile",20).json(s3_path)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions