Writing glue dynamic frame to s3 is taking too long

0

Hi I have a glue job running with PySpark. Its taking too long to write the dynamic frame to s3. For around 1200 records writing it too around 500 seconds alone for writing to s3. I have observed that even if data frame is empty still it takes same amount of time to write to s3.

Below are code snippets -

test1_df = test_df.repartition(1)

invoice_extract_final_dyf = DynamicFrame.fromDF(test1_df, glueContext, "invoice_extract_final_dyf")

glueContext.write_dynamic_frame.from_options(frame=invoice_extract_final_dyf, connection_type="s3", connection_options={"path": destination_path}, format="json")

The conversion in 2nd line and writing to s3 both of these consumes most of the time. Any help will be appreciated. Let me know if any further details are needed.

已提問 1 年前檢視次數 2319 次
2 個答案
1

Notice that when you "repartition(1)" only one core of the cluster can do work from them all, if you want to just generate a file put the repartition as late as possible (just before the write).
Also bear in mind that when you run the write, that is not running the write but all the work from source to the point it writes (e.g. repartition, filtering, etc), so even if at the end there is no data coming out, it has to do all the work to reach that.

profile pictureAWS
專家
已回答 1 年前
  • Thanks for the reply !! The above 3 lines are the last 3 lines of the glue job. Do you still have any suggestions in the ordering of these lines ?

0

Then you can't move the repartition down further (you could move it after the conversion but I don't think it will make any difference

profile pictureAWS
專家
已回答 1 年前
  • I even tried to write data frame directly to s3, skipping both the repartitioning and data frame to dynamic frame conversion. But still it was consuming same amount of time -

    test_df.write.mode("overwrite").format('json').save(destination_path + '/testing-perf3')

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南