How do I manage and optimize Iceberg tables for efficient data storage and querying?
I want to optimize my Apache Iceberg tables so that I achieve efficient data storage and querying.
Resolution
Choose the type of Iceberg table for your use case
The resolution that you choose depends upon the type of Iceberg table that you set up.
Use a Copy-On-Write (CoW) table for reads
In this type of Iceberg table, any update or delete operation that you make to the records rewrites the corresponding data files at the back end. A rewrite slows performance, especially if there are several updates and deletes. Use a CoW table if your use cases involve more reads instead of writes.
To convert an existing Iceberg table to a CoW table, run the following Apache Spark SQL command:
spark.sql("ALTER TABLE <table-name> SET TBLPROPERTIES ('write.delete.mode'='copy-on-write','write.update.mode'='copy-on-write')")
To create a new CoW table, use the table properties 'write.delete.mode'='copy-on-write','write.update.mode'='copy-on-write':
dataFrame.createOrReplaceTempView("tmp_<your_table_name>") query = f""" CREATE TABLE glue_catalog.<your_database_name>.<your_table_name> USING iceberg TBLPROPERTIES ('format-version'='2','write.delete.mode'='copy-on-write','write.update.mode'='copy-on-write') AS SELECT * FROM tmp_<your_table_name> """ spark.sql(query)
Use a Merge-On-Read (MoR) table for writes
When you update or delete records in a MoR table, the action adds new data files. The newly added delete data files are merged during a read. However, write operations require you to just add new files to existing files. If you more frequently write to the table than you read from the table, select a MoR table.
To use a MoR Iceberg table, run the following Spark SQL command:
spark.sql("ALTER TABLE <table-name> SET TBLPROPERTIES ('write.delete.mode'='merge-on-read','write.update.mode'='merge-on-read')")
Note: As a best practice, to retain better control over the properties, create the Iceberg tables from a Spark engine. You can also use AWS Glue, EMR Spark, or Amazon Athena to create the tables. However, Athena has limited support for table properties and uses only the MoR type of table.
Optimize Iceberg tables
Iceberg tables can show a dip in query performance because of an increase in the number of metadata files, delete files, and so on. Here are a few ways that you can use to optimize query efficiency and data storage.
Expire snapshots
Iceberg tables maintain snapshots so that you can fetch data from an older state of the table. These snapshots are written for every write operation performed against the table and the relevant snapshot ID is added to the new metadata file. Over time, the number of snapshots increases the size of the metadata file. These additional snapshots create a delay in query performance.
To expire snapshots, use the following options:
-
Use the expireSnapshots operation in Spark to expire snapshots that are older than a specified timestamp in parallel for large tables:
SparkActions.get() .expireSnapshots(table) .expireOlderThan(tsToExpire) .execute()
-
Or, use a procedure called expire_snapshots. For more information, see expire_snapshots on the Iceberg website.
spark.sql("CALL glue_catalog.system.expire_snapshots('databasename.tablename',<timestamp value>)")
Run the preceding code within an AWS Glue job at regular intervals. When you automate snapshot expiration, you can limit the number of data files, keep the metadata file small, and maintain efficient query performance.
Remove old metadata files
Set the write.metadata.delete-after-commit.enabled table property to True to automatically delete old metadata files after each table commit. You can also set write.metadata.previous-versions-max to manage the number of old metadata files to keep.
Rewrite manifest files
An Iceberg table uses manifests and manifest files to track all data files. Over time, each snapshot refers to many manifest files. These operations slow down the queries. For more information, see manifest lists and manifest files on the Iceberg website.
Use the rewrite manifests procedure to efficiently manage manifest files. For more information, see rewrite_manifests on the Iceberg website.
Run the following Spark SQL query:
spark.sql("CALL glue_catalog.system.rewrite_manifests('databasename.tablename')")
Rewrite data files
Iceberg maintains and tracks all of the table's data files in a metadata file. Over time, a lot of accumulated data files increase the size of the metadata file. Unnecessary or open files in the metadata file decrease read efficiency. The rewrite_data_files procedure in Spark helps to compact data in parallel and increase read efficiency. For more information, see rewrite_data_files on the Iceberg website.
Run the following Spark SQL command:
spark.sql("CALL glue_catalog.system.rewrite_data_files(table=>'databasename.tablename')")
Use BINPACK or SORT strategies to rewrite the data files for your use case. For more information, see the BINPACK and SORT on the Iceberg website.
BINPACK: It's the cheapest and also the fastest approach. It combines the smaller files into larger files and decreases the total number of output files. The order of the records isn't disturbed and there's no shuffling of data. This is the default option.
CALL catalog.system.rewrite_data_files( table => 'test_table', strategy => 'binpack', options => map( 'rewrite-job-order','bytes-asc', 'target-file-size-bytes','<set-afile-size>', 'max-file-group-size-bytes','<max-group-size>' -- 10GB ) )
SORT: The SORT strategy sorts the data while it compacts the files. This strategy is useful when you run many aggregate functions that compare the neighboring records (for example, the min or max functions).
CALL catalog\_name.system.rewrite\_data\_files( table => 'databasename.tablename', strategy => 'sort', sort\_order => 'id', --can be any column options => map('rewrite-all','true') )
Remove the orphan files
Orphan files aren't referenced in any of the metadata files. For more information, see remove_orphan_files on the Iceberg website.
To remove orphan files, run the remove_orphan_files command as shown below:
spark.sql("CALL glue_catalog.system.remove_orphan_files(table=>'databasename.tablename')")
Note: As a best practice, run a scheduled job to manage maintenance activities. Use a single AWS Glue job to run all the above Spark SQL queries.
Related information
Relevant content
- Accepted Answerasked 2 years agolg...
- asked 2 years agolg...
- asked 2 years agolg...
- Accepted Answerasked 8 months agolg...
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 7 months ago