1 réponse
- Le plus récent
- Le plus de votes
- La plupart des commentaires
1
In pyspark, to write the same dataframe to multiple locations, you need to have two write statements but the distribution to partitions is the costly action hence the slowness. Efficient way is to copy the output from OUTPUT_LOCATION_1 to OUTPUT_LOCATION_2 outside of pyspark through cp. In spark, you can try to repartition with a specified number(example:5) before writing to see if helps the performance with two write statements.
result.repartition(5).write.partitionBy("col2").mode("append").parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")
Contenus pertinents
- demandé il y a un an
- demandé il y a un an
- demandé il y a un an
- AWS OFFICIELA mis à jour il y a un an
- AWS OFFICIELA mis à jour il y a un an
- AWS OFFICIELA mis à jour il y a 4 mois