1 Answer
- Newest
- Most votes
- Most comments
0
Are you writing to S3 or to a table stored in s3? Are you using Spark on Glue?
I prefer to use a util function that, in part, includes the following logic:
# default param values: # write_format: str = "parquet", # write_compression: str = "snappy", # write_format: str = "overwrite", # partition_cols: Union[str, List[str]] = None, # If table does not exist, provide S3 path when saving to create the table dynamically as an external table # function to check if the table exists if not table_exists(spark_session=spark_session, db_table=target_db_table): if partition_cols: logger.info(" - partitioned by %s" % partition_cols) ( df.write.option("compression", write_compression) .option("path", target_s3_path) .saveAsTable( name=target_db_table, mode=write_mode, format=write_format, partitionBy=partition_cols, ) ) # Otherwise, dynamically overwrite existing partition value combinations and append new partitions else: ( df.select(spark_session.table(target_db_table).columns).write.insertInto( tableName=target_db_table, overwrite=True, ) )
answered 9 months ago
Relevant content
- Accepted Answerasked 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
If you are just writing files to S3, you still should be able to overwrite data in an s3 path using Spark. If not, you could utilize boto3 and Python to delete data in the s3 path before writing (last resort).