- Newest
- Most votes
- Most comments
Apache Iceberg tables in S3, when properly configured, should not drop data values for columns that don't exist in the schema definition. Iceberg schema changes are indeed metadata-only changes, and no data files are altered when you perform a schema update. However, the behavior you're describing suggests there might be a configuration issue or a misunderstanding of how the data was ingested.
To address your specific concerns:
-
Iceberg tables should not drop data values for columns that don't exist in the schema file. The data should still be present in the underlying files.
-
There is a way to retain data for arbitrary columns even if they don't exist in the table schema definition yet. This is part of Iceberg's schema evolution capabilities.
-
To update the schema definition of an S3 Iceberg table and retain values for data that has already been sent, you can use Athena's ALTER TABLE ADD COLUMNS statement. This will add the new columns to the schema without modifying existing data files.
To ensure that new columns are properly recognized and populated with existing data, you need to configure your write operations correctly. Here are some steps to consider:
-
Enable schema evolution: Set the table property 'write.spark.accept-any-schema' to 'true'. This allows the table to accept data with schemas that don't exactly match the table schema.
-
Use dynamic schema merging: When writing data with Spark, use the "merge-schema" option set to "true". This allows new columns from the source to be added to the target table.
-
Disable schema ordering checks: Set 'spark.sql.iceberg.check-ordering' to false to allow more flexible schema evolution.
Here's an example of how you might configure these settings:
ALTER TABLE your_table SET TBLPROPERTIES ('write.spark.accept-any-schema'='true')
-- When writing data:
df.writeTo("your_table").option("merge-schema", "true").append()
If you're still seeing null values for newly added columns after applying these configurations, it's possible that the data wasn't properly ingested with the new columns in the first place. You may need to investigate your data ingestion process to ensure that the new fields are being correctly written to the Iceberg table.
Remember, Iceberg's schema evolution is designed to be flexible and non-destructive. If configured correctly, it should allow you to add new columns and retain data for those columns, even if they weren't in the original schema definition.
Sources
Evolve Iceberg table schema - Amazon Athena
Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality | AWS Big Data Blog
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 4 years ago