Writing glue dynamic frame to s3 is taking too long

Question

Hi I have a glue job running with PySpark. Its taking too long to write the dynamic frame to s3. For around 1200 records writing it too around 500 seconds alone for writing to s3. I have observed that even if data frame is empty still it takes same amount of time to write to s3.

Below are code snippets -

> test1_df = test_df.repartition(1)

> invoice_extract_final_dyf = DynamicFrame.fromDF(test1_df, glueContext, "invoice_extract_final_dyf")

> glueContext.write_dynamic_frame.from_options(frame=invoice_extract_final_dyf,
                                                                    connection_type="s3",
                                                                    connection_options={"path": destination_path},
                                                                    format="json")

The conversion in 2nd line and writing to s3 both of these consumes most of the time. Any help will be appreciated. Let me know if any further details are needed.

Answer

Notice that when you "repartition(1)" only one core of the cluster can do work from them all, if you want to just generate a file put the repartition as late as possible (just before the write).   
Also bear in mind that when you run the write, that is not running the write but all the work from source to the point it writes (e.g. repartition, filtering, etc), so even if at the end there is no data coming out, it has to do all the work to reach that.

Answer

Then you can't move the repartition down further (you could move it after the conversion but I don't think it will make any difference

Writing glue dynamic frame to s3 is taking too long

관련 콘텐츠