1回答
- 新しい順
- 投票が多い順
- コメントが多い順
1
In pyspark, to write the same dataframe to multiple locations, you need to have two write statements but the distribution to partitions is the costly action hence the slowness. Efficient way is to copy the output from OUTPUT_LOCATION_1 to OUTPUT_LOCATION_2 outside of pyspark through cp. In spark, you can try to repartition with a specified number(example:5) before writing to see if helps the performance with two write statements.
result.repartition(5).write.partitionBy("col2").mode("append").parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")
関連するコンテンツ
- AWS公式更新しました 3年前