Skip to content

AWS Glue Interactive Sessions Query Iceberg and Non Iceberg Tables

0

Im using DBT with glue adapter to create our datalake on AWS Glue. Im not able to query Iceberg and Non Iceberg tables. These tables are in same Glue Catalog (Same database). Im able to query non-iceberg tables and create iceberg tables with this configuration:

       --conf spark.sql.catalog.glue_catalog.warehouse=<s3>
        --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
        --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
        --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
        --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

And im able o query iceberg tables with this configuration:

spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
        --conf spark.sql.catalog.spark_catalog.warehouse<s3>
        --conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
        --conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
        --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

But im not able to find a configuration that allows me query iceberg and non-iceberg tables.

Thanks for any help

asked a year ago971 views
3 Answers
1
Accepted Answer

Solution:

--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://<bucket>
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
 --conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Don't use datalake_formats configuration. Glue Version 5.0 Extra Jars: iceberg-aws-bundle-1.7.0.jar and iceberg-spark-runtime-3.3_2.12-1.7.1.jar

More information: https://github.com/aws-samples/dbt-glue/issues/405

answered a year ago
EXPERT
reviewed 10 months ago
0

I have run the following script to read/write iceberg and non-iceberg table:

from pyspark.sql import SparkSession

# Initialize the Spark session with Glue catalog for Iceberg and non-Iceberg tables
spark = SparkSession.builder \
    .appName("IcebergAndGlueTables") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue.warehouse", "s3://myicebergdata/warehouse/") \
    .config("spark.sql.catalog.glue.uri", "glue://") \
    .config("spark.hadoop.hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .getOrCreate()
print("Spark session configured for Iceberg and Glue tables.")
try:

    # Step 1: Create an Iceberg table using the Glue catalog
    print("Creating Iceberg table in Glue catalog...")
    spark.sql("""
    CREATE TABLE IF NOT EXISTS glue.default.my_iceberg_table (
        id BIGINT,
        name STRING
    )
    USING iceberg
    """)

    # Step 2: Insert data into Iceberg table
    print("Inserting data into Iceberg table...")
    spark.sql("""
    INSERT INTO glue.default.my_iceberg_table VALUES
        (1, 'Alice'),
        (2, 'Bob')
    """)

    # Step 3: Create a non-Iceberg (Glue-managed Parquet) table
    print("Creating non-Iceberg (Parquet) table in Glue catalog...")
    spark.sql("""
    CREATE TABLE IF NOT EXISTS glue.default.my_parquet_table (
        id BIGINT,
        name STRING
    )
    STORED AS PARQUET
    LOCATION 's3://myicebergdata/tables/my_parquet_table/'
    """)

    # Step 4: Insert data into the non-Iceberg Parquet table
    print("Inserting data into non-Iceberg table...")
    spark.sql("""
    INSERT INTO glue.default.my_parquet_table VALUES
        (3, 'Charlie'),
        (4, 'Diana')
    """)

    # Step 5: Read data from Iceberg table
    print("Reading data from Iceberg table...")
    df_iceberg = spark.sql("SELECT * FROM glue.default.my_iceberg_table")
    df_iceberg.show()

    # Step 6: Read data from non-Iceberg (Parquet) table
    print("Reading data from non-Iceberg table...")
    df_parquet = spark.sql("SELECT * FROM glue.default.my_parquet_table")
    df_parquet.show()

    # Step 7: Combine data from both tables (Iceberg + Parquet)
    print("Combining data from both tables...")
    combined_df = df_iceberg.union(df_parquet)
    combined_df.show()
    print("Spark job completed successfully!")

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # Stop the Spark session
    spark.stop()

Output:

Spark session configured for Iceberg and Glue tables.
Creating Iceberg table in Glue catalog...
Inserting data into Iceberg table...
Creating non-Iceberg (Parquet) table in Glue catalog...
Inserting data into non-Iceberg table...
Reading data from Iceberg table...
+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  1|Alice|
|  2|  Bob|
|  1|Alice|
|  2|  Bob|
+---+-----+

Reading data from non-Iceberg table...
+---+-------+
| id|   name|
+---+-------+
|  3|Charlie|
|  4|  Diana|
+---+-------+

Combining data from both tables...
+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  1|  Alice|
|  2|    Bob|
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  Diana|
+---+-------+

Spark job completed successfully!

Spark Session Configuration:

  1. The Spark session is configured to use the AWS Glue Catalog as the catalog for both Iceberg and non-iceberg (Parquet) tables.
  2. The configuration includes spark.sql.catalog.glue to specify that we're using the Glue catalog and enabling the necessary extensions for Iceberg.
  3. Icerberg Table Creation: The first SQL block in above job creates an Iceberg table using Glue as the metastore.
  4. Non-Iceberg Table Creation: The second SQL block creates a Parquet table managed by Glue. You can replace S3:// with your actual S3 location where you want the data to be stored.
  5. Inserting Data into Both Tables:
  6. Reading Data
  7. Combining Data
AWS
answered a year ago
-1

To query both Iceberg and non-Iceberg tables in AWS Glue Interactive Sessions using the same configuration, you need to set up your Spark session to support both table types. Here's a configuration that should allow you to query both Iceberg and non-Iceberg tables:

--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.warehouse=<s3_warehouse_path>
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive

This configuration combines elements from both of your provided configurations. Here's what each line does:

  1. Sets up the glue_catalog as an Iceberg catalog.
  2. Specifies the S3 warehouse location for Iceberg tables. 3-4. Configures the Glue catalog implementation and S3 file I/O for Iceberg.
  3. Adds Iceberg extensions to Spark. 6-7. Configures the default spark_catalog to use a SparkSessionCatalog that can handle both Iceberg and non-Iceberg tables.

With this configuration, you should be able to query both Iceberg and non-Iceberg tables in the same Glue Interactive Session. For Iceberg tables, you can use the glue_catalog prefix, and for non-Iceberg tables, you can use the default catalog or specify the spark_catalog prefix.

When using dbt with the Glue adapter, make sure to update your dbt profile to include these configurations. You may need to adjust your dbt models to use the appropriate catalog prefix when referencing tables.

If you're still experiencing issues, double-check that your IAM roles and permissions are correctly set up to access both Iceberg and non-Iceberg tables in your Glue Data Catalog and S3 locations.

Remember that AWS Glue supports Apache Iceberg as one of the open-source frameworks for managing data in a transactionally consistent manner in your Amazon S3-based data lake. This support allows you to work with both traditional and Iceberg tables within the same environment.

Sources
Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter | AWS Big Data Blog
Serverless Data Integration – AWS Glue Features – AWS
Optimizing query performance for Iceberg tables - AWS Glue

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.