Migrating Glue Data Catalog tables to use Apache Iceberg open table format using Athena

7 minute read
Content level: Advanced
0

Implement transactional capabilities on your S3 datasets and enforce strongly consistent reads/writes with Iceberg

Introduction

Customers are increasingly adopting the data lakehouse architecture, allowing them to unify their disparate data sources into a single location for consumption. With this architecture, customers are able to use the flexibility of data management tools and ACID transactions of a traditional data warehouse system at lower cost. AWS has announced new integrations across all of our analytics offerings to natively support open table formats like Apache Iceberg to enable our customers to adopt the data lakehouse architecture easily.

In this article, we will be taking a look at migrating our existing tables in the AWS Glue Data Catalog to use the Apache Iceberg open table format using Amazon Athena.

Scenarios for Migration

Let’s take a look at some common scenarios of why a customer is considering migrating their existing tables to use the Apache Iceberg open table format.

Scenario A

Consider a company that has business requirements to ensure data is retained according to local regulations. With Iceberg, the company can maintain compliance by expiring older versions of tables. To meet auditing compliance, Iceberg tables also provide an audit trail of data changes over time.

Scenario B

Consider a team of analysts who are conducting a root cause analysis on the rate of returns for a product line. With Iceberg, the team can perform historical reporting using time travel queries.

Scenario 3

Consider a company that has centralized all of their datasets in a data lake. The data lake has multiple producers writing to their datasets and consumers reading from their datasets. Teams are seeing a delay when datasets are updated after writing to them. With Iceberg, teams can ensure their data is strongly consistent with in-built transactions and write-ahead log capabilities.

Scenarios to Avoid Migration

Iceberg tables are often best suited for large batch workloads. Iceberg introduces overhead during reads and writes from processing transactions and managing metadata, which may make it not suitable for latency-sensitive applications. Scenarios that do not require complex partitioning or transactional capabilities should also opt to not migrate their tables to Iceberg.

Iceberg Migrations

There are two types of migration to Iceberg tables supported.

In-Place Migration

This process migrates the table by leaving the data files of the source table as is and adds it to the new Iceberg table. Only the metadata of the target table will be created during migration. This process is less time-consuming and preserves the data lineage of the source table. In addition, any error during the migration will simply require a rewrite of the metadata, not the data files itself.

The cons of this approach are:

  • If data is written to the source table during migration, this process will need to be repeated to include new data files in the target table metadata. It may be necessary to enforce write downtime on the source table to avoid repeating the migration.
  • If schema or partition changes are needed, this process will not work as the data will need restatement. Please refer to the steps for shadow migration.

Shadow Migration

This process migrates the table by restating the data of the source table. There are several benefits to using this approach. Data can be validated during the migration and different configurations can be tested by creating a new table for each configuration. Target data files also will be placed in a different location than the source data files to avoid any collisions. If any issues arise during this process, rollback and recovery can be performed by deleting the target table.

The cons of this approach are:

  • Storage utilization will be doubled as data files will now exist for both the source and target tables separately. This can be mitigated by retiring the source table once migration is complete.
  • This process can be time-intensive because data is restated. Migration time will depend on the size of the source table.
  • Keeping data in sync between the source and target tables will be difficult. This can be mitigated by enforcing write downtime on the source table during the migration so source data files are not changed.

Solution

Pre-Requisites

  • An existing table registered in the Glue Data Catalog to be used for migration
  • An IAM role attached to the workgroup you will be using for Athena Notebooks
    • This role must have the correct AWS Glue and Amazon S3 permissions to read from the existing table.
    • You will also need to configure write access to your S3 bucket for the destination table.
    • Note: If you’re using AWS Lake Formation to govern your tables, you will need to give your workgroup role read access from within Lake Formation instead of attaching Glue and S3 permissions directly to the role.

Implementation

  1. Head over to the Athena console and create a new notebook with the following Spark properties selected. Ensure you are using the correct workgroup when creating your notebook.
    Athena notebook session settings
  2. Add the following property to your Apache Spark properties before creating your notebook:
    Key: spark.sql.catalog.spark_catalog.warehouse
    Value: s3://[YOUR_BUCKET_NAME]/
  3. Verify you have read access to the source table in the Glue Data Catalog. You should see the first 5 records of the source table in your notebook.
spark.sql('''
SELECT * FROM [SRC_DB_NAME].[SRC_TABLE_NAME]
''').show(5)
  1. Follow the steps below for in-place migration:
  • Run the following code in your notebook to create a new Iceberg table as the target for the migration:
# create new iceberg table with same schema and different location
# see https://iceberg.apache.org/spec/#schemas-and-data-types for supported data types
spark.sql("""
CREATE TABLE [TARGET_DB_NAME].[TARGET_TABLE_NAME] (
    [SRC_COLUMN1_NAME] [SRC_COLUMN1_TYPE],
    [SRC_COLUMN2_NAME], [SRC_COLUMN2_TYPE],
    ...
)
USING iceberg
LOCATION 's3://[TARGET_BUCKET_NAME]/[TARGET_FOLDER]/'
""")
  • Run the following code in your notebook to start the migration:
# migrate inplace via add file method
spark.sql("""
CALL spark_catalog.system.add_files(
table => '[TARGET_DB_NAME].[TARGET_TABLE_NAME]',
source_table => '[SRC_DB_NAME].[SRC_TABLE_NAME]'
)
""").show()
  1. Follow the steps below for shadow migration:
  • Run the following code in your notebook to start the migration:
# create table with shadow migration
spark.sql("""
CREATE TABLE [TARGET_DB_NAME].[TARGET_TABLE_NAME]
USING iceberg
location 's3://[YOUR_S3_BUCKET]/[TARGET_FOLDER/'
AS (SELECT * FROM [SRC_DB_NAME].[SRC_TABLE_NAME])
""")
  • Optional: You can also perform a shadow migration directly in the Athena query editor via the following code:
CREATE TABLE [TARGET_DB_NAME].[TARGET_TABLE_NAME]
WITH (table_type = 'ICEBERG',
      format = 'PARQUET', 
      location = 's3://[YOUR_S3_BUCKET]/[TARGET_FOLDER/', 
      is_external = false,
      vacuum_min_snapshots_to_keep = 10,  # optional property
      vacuum_max_snapshot_age_seconds = 604800 # optional property
   ) 
AS SELECT * FROM [SRC_DB_NAME].[SRC_TABLE_NAME];

Clean Up

Run the following to delete any new tables that are not needed:

spark.sql('''
DROP TABLE [DB_NAME].[TABLE_NAME] purge;
''')

Conclusion

This article demonstrated how to migrate your existing tables in the Glue Data Catalog to use the Apache Iceberg open table format using Athena. We demonstrated the two supported methods of migration. In addition, we covered common scenarios of when it is useful to migrate tables to Iceberg. Migrating tables to Iceberg is not suitable in every use case, especially for smaller workloads and latency-sensitive workloads.

Resources

profile pictureAWS
EXPERT
published 16 days ago540 views