In AWS Glue Interactive Notebooks, the default behavior, when saving data via df.saveAsTable(), is to overwrite an entire table regardless of the dataframe values. How can I enable the Spark setting to only overwrite table partitions dynamically based on the dataframe?
Generic example:
- "table1" has the partition column, event_date, with the values, "2023-01-01" and "2023-01-02"
- The spark dataframe, df, has only the event_date value, "2023-01-03"
from awsglue.context import GlueContext
from pyspark.context import SparkContext
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
spark_session = (
glue_context.spark_session.builder.appName(session_name)
.config("spark.logConf", "true")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.enableHiveSupport()
.getOrCreate()
)
(
df.write.option("compression", "snappy")
# including the path creates the table if not exists
# as external instead of managed
.option("path", target_s3_path)
.saveAsTable(
name=target_db_table,
mode="overwrite",
format="parquet",
partitionBy="event_date",
)
)
I tried both of the following methods and get the same issue in a Glue notebook while simultaneously reading from the table in Athena for testing purposes. The table definition should not be completely recreated and partitions re-added when "overwriting" a new partition. In my test, the data is only a new partition value.
The biggest issue with this behavior is that, when a job is long-running, fails when writing data, or is killed, the table may not be able to be read until it is manually recreated.
I also wonder if this type of behavior is because I am using Glue as the data catalog.
Update: this is not ideal, but I was able to get new partitions to append dynamically and not drop/recreate the table by using the following:
I don't think ".option("partitionOverwriteMode", "dynamic")" is doing anyway, without specifying the full property name