S3 tables schema evolution and data loss

0

Let's say we have an s3 table with a specific set of fields defined for its schema, sometime later we send new data with more fields that do not exist in schema definition yet, now we update S3 table definition with new fields using Athena SQL, what is expected at this point is that we find data for the added column but instead the newly added column has all null values which is not expected because Iceberg schema changes are supposed to be metadata changes

Do S3 tables (Apache iceberg) drop data values for columns that are not existing in the schema file? is there a way to retain data for arbitrary columns even if they don't exist in the S3 table schema definition yet? how to update schema definition of S3 table and still retain values of columns for already data that has already been sent?

Note: data is being written from Lambda function used by Firehose

asked 10 days ago65 views
1 Answer
1

Apache Iceberg tables in S3, when properly configured, should not drop data values for columns that don't exist in the schema definition. Iceberg schema changes are indeed metadata-only changes, and no data files are altered when you perform a schema update. However, the behavior you're describing suggests there might be a configuration issue or a misunderstanding of how the data was ingested.

To address your specific concerns:

  1. Iceberg tables should not drop data values for columns that don't exist in the schema file. The data should still be present in the underlying files.

  2. There is a way to retain data for arbitrary columns even if they don't exist in the table schema definition yet. This is part of Iceberg's schema evolution capabilities.

  3. To update the schema definition of an S3 Iceberg table and retain values for data that has already been sent, you can use Athena's ALTER TABLE ADD COLUMNS statement. This will add the new columns to the schema without modifying existing data files.

To ensure that new columns are properly recognized and populated with existing data, you need to configure your write operations correctly. Here are some steps to consider:

  1. Enable schema evolution: Set the table property 'write.spark.accept-any-schema' to 'true'. This allows the table to accept data with schemas that don't exactly match the table schema.

  2. Use dynamic schema merging: When writing data with Spark, use the "merge-schema" option set to "true". This allows new columns from the source to be added to the target table.

  3. Disable schema ordering checks: Set 'spark.sql.iceberg.check-ordering' to false to allow more flexible schema evolution.

Here's an example of how you might configure these settings:

ALTER TABLE your_table SET TBLPROPERTIES ('write.spark.accept-any-schema'='true')

-- When writing data:
df.writeTo("your_table").option("merge-schema", "true").append()

If you're still seeing null values for newly added columns after applying these configurations, it's possible that the data wasn't properly ingested with the new columns in the first place. You may need to investigate your data ingestion process to ensure that the new fields are being correctly written to the Iceberg table.

Remember, Iceberg's schema evolution is designed to be flexible and non-destructive. If configured correctly, it should allow you to add new columns and retain data for those columns, even if they weren't in the original schema definition.

Sources
Evolve Iceberg table schema - Amazon Athena
Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality | AWS Big Data Blog

profile picture
answered 10 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions