Is there an optimal way in pyspark to write the same dataframe to multiple locations?

0

I have a dataframe in pyspark and I want to write the same dataframe to two locations in AWS s3. Currently I have the following code running on AWS EMR.

# result is the name of the dataframe
        
result = result.repartition(repartition_value, 'col1').sortWithinPartitions('col1')

result.write.partitionBy("col2")\
      .mode("append") \
      .parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")

result.write.partitionBy("col2") \
      .mode("append") \
      .parquet(f"{OUTPUT_LOCATION_2}/processed_date={current_date_str}")

The inclusion of this additional write step has increased the runtime of the job significantly (almost double). Could it be that the lazy evaluation of spark runs the same steps twice?

I have tried caching the data prior using result.cache() and forcing an action after e.g. result.count() but this hasnt provided any benefits. What would be the most efficient way to do a double dataframe output write?

質問済み 2年前1675ビュー
1回答
1

In pyspark, to write the same dataframe to multiple locations, you need to have two write statements but the distribution to partitions is the costly action hence the slowness. Efficient way is to copy the output from OUTPUT_LOCATION_1 to OUTPUT_LOCATION_2 outside of pyspark through cp. In spark, you can try to repartition with a specified number(example:5) before writing to see if helps the performance with two write statements.

result.repartition(5).write.partitionBy("col2").mode("append").parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")

サポートエンジニア
回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ