Glue Job owerwrite target files


I have a Glue job that process data from the source S3 bucket to the target S3 bucket. Each run it creates new files with the same data. Therefore, I have duplicates in the result table. Is there a way to overwrite target S3 files each run using Glue Job Visual editor? Yes, I can switch job to 'script' and edit job manually, but this is one way road and after it I lose all benefits of Visual job editing.

  • If you are just writing files to S3, you still should be able to overwrite data in an s3 path using Spark. If not, you could utilize boto3 and Python to delete data in the s3 path before writing (last resort).

asked 10 months ago768 views
1 Answer

Are you writing to S3 or to a table stored in s3? Are you using Spark on Glue?

I prefer to use a util function that, in part, includes the following logic:

# default param values:
#     write_format: str = "parquet",
#     write_compression: str = "snappy",
#     write_format: str = "overwrite",
#     partition_cols: Union[str, List[str]] = None,

    # If table does not exist, provide S3 path when saving to create the table dynamically as an external table
    # function to check if the table exists
    if not table_exists(spark_session=spark_session, db_table=target_db_table):
        if partition_cols:
  "    - partitioned by %s" % partition_cols)

            df.write.option("compression", write_compression)
            .option("path", target_s3_path)
    # Otherwise, dynamically overwrite existing partition value combinations and append new partitions
profile picture
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions