I want to migrate my non data lake tables to Apache Iceberg tables.
Short description
A robust data lake architecture allows you to consolidate your various data sources into a single location for easy access. You can then use data management tools and the ACID transactions of a traditional data warehouse system to lower costs.
Resolution
Iceberg tables support two kinds of migration: in-place migration and shadow migration. For more information, see in-place migration and shadow migration on the Iceberg website.
Initialize an Apache Spark session with the Iceberg configuration
Create a new notebook in AWS Glue. Make sure to use the AWS Identity and Access Management (IAM) role with the correct permission when you create the notebook.
Complete the following steps:
- Log in to the AWS Glue console.
- Create a Glue notebook and configure the following parameters:
%glue_version 4.0
%idle_timeout 60
%number_of_workers 5
%worker_type G.1X
%%configure
{
"--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"--datalake-formats": "iceberg"
}
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config(f"spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.glue_catalog.warehouse","s3://bucket_name/prefix/") \
.config(f"spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config(f"spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()
Use in-place migration to move non data lake tables
In-place migration leaves the data files of the source table as-is, and then adds them to the new Iceberg table. The process creates only the metadata of the target table. Any errors that occur during the migration require only a rewrite of the metadata and not a rewrite of the data files themselves.
To migrate the table, complete the following steps:
Note: Use add_files for Parquet data.
-
Create an empty Iceberg table with the same structure as the source table.
query = f"""
CREATE TABLE IF NOT EXISTS glue_catalog.dbname.target_iceberg_tablename USING iceberg LOCATION 's3://bucket_name/prefix/' AS
SELECT * FROM dbname.source_tablename LIMIT 0
"""
spark.sql(query)
-
Run the add_files procedure:
query = f"""
CALL glue_catalog.system.add_files(table => 'dbname.target_iceberg_tablename', source_table => 'dbname.source_tablename')
"""
spark.sql(query).show(truncate=False)
-
Check that the data files still point to the previous data files at the source location. Only the metadata files that were created in the new location belong to the Iceberg table:
query = f"""
SELECT file_path FROM glue_catalog.dbname.target_iceberg_tablename.files
"""
spark.sql(query).show(10, truncate=False)
Use shadow migration to move non data lake tables
The shadow migration process restates the data of the source table. You can validate the data during migration. You can also test different configurations by creating a new table for each configuration.
To avoid any collision, this type of migration places target data files in a different location than the source data files. The process also allows you to use rollback and delete the target table as a way to recover quickly from any issues.
To migrate the table, complete the following steps:
-
Create an Iceberg table with data from the existing table:
query = f"""
CREATE TABLE IF NOT EXISTS glue_catalog.dbname.target_iceberg_tablename USING iceberg LOCATION 's3://bucket_name/prefix/' AS
SELECT * FROM dbname.source_tablename
"""
spark.sql(query).show(truncate=False)
-
Check that the data files migrated to the new location. The metadata files that were created in the new location belong to the Iceberg table:
query = f"""
SELECT file_path FROM glue_catalog.dbname.target_iceberg_tablename.files
"""
spark.sql(query).show(10, truncate=False)
Related information
Using the Iceberg framework in AWS Glue