Writing glue dynamic frame to s3 is taking too long

0

Hi I have a glue job running with PySpark. Its taking too long to write the dynamic frame to s3. For around 1200 records writing it too around 500 seconds alone for writing to s3. I have observed that even if data frame is empty still it takes same amount of time to write to s3.

Below are code snippets -

test1_df = test_df.repartition(1)

invoice_extract_final_dyf = DynamicFrame.fromDF(test1_df, glueContext, "invoice_extract_final_dyf")

glueContext.write_dynamic_frame.from_options(frame=invoice_extract_final_dyf, connection_type="s3", connection_options={"path": destination_path}, format="json")

The conversion in 2nd line and writing to s3 both of these consumes most of the time. Any help will be appreciated. Let me know if any further details are needed.

asked a year ago2267 views
2 Answers
1

Notice that when you "repartition(1)" only one core of the cluster can do work from them all, if you want to just generate a file put the repartition as late as possible (just before the write).
Also bear in mind that when you run the write, that is not running the write but all the work from source to the point it writes (e.g. repartition, filtering, etc), so even if at the end there is no data coming out, it has to do all the work to reach that.

profile pictureAWS
EXPERT
answered a year ago
  • Thanks for the reply !! The above 3 lines are the last 3 lines of the glue job. Do you still have any suggestions in the ordering of these lines ?

0

Then you can't move the repartition down further (you could move it after the conversion but I don't think it will make any difference

profile pictureAWS
EXPERT
answered a year ago
  • I even tried to write data frame directly to s3, skipping both the repartitioning and data frame to dynamic frame conversion. But still it was consuming same amount of time -

    test_df.write.mode("overwrite").format('json').save(destination_path + '/testing-perf3')

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions