Skip to content

Unable get past DBT run when using DBT Glue with Iceberg + Spark

0

I have existing Iceberg tables in gov cloud. I have created them using AWS Glue ETL spark py jobs. My dbt debug command says "All checks have passed" But my dbt run command gives me the below error: I have a glue execution role and a DynamoDB lock table created. dbt run --select modelname.sql 22:38:08 Running with dbt=1.9.6 22:38:09 Registered adapter: glue=1.9.4 22:38:09 Unable to do partial parsing because saved manifest not found. Starting full parse. 22:38:12 Found 2 models, 2 sources, 520 macros 22:38:12 22:38:12 Concurrency: 1 threads (target='dev') 22:38:12 22:38:15 1 of 1 START sql table model dbname.modelname [RUN] 22:38:59 Glue adapter: Glue returned error` for statement None for code SqlWrapper2.execute('''/* {"app": "dbt", "dbt_version": "1.9.6", "profile_name": "glue_profile", "target_name": "dev", "node_id": "model.dbt_project.modelname"} */

create table
   dbname.modelname
    
using PARQUET

LOCATION 's3://s3path/lakehouse/dbname/modelname'

as

select abc, pqr from dbname.tablename

''', use_arrow=False, location='s3://s3path/lakehouse/'), Py4JJavaError: An error occurred while calling o244.sql. : java.lang.RuntimeException: java.lang.InstantiationException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:158) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:127) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:196) at org.apache.spark.rdd.HadoopRDD.$anonfun$getPartitio`

asked a year ago282 views
1 Answer
0

This error is related to configuration issues when trying to use DBT with AWS Glue to work with Iceberg tables. The error occurs because your DBT configuration isn't properly set up to handle Iceberg tables.

When working with Iceberg tables in AWS Glue through DBT, you need specific Spark configurations to enable proper catalog integration. Here's how to fix this:

  1. Make sure you have the correct Spark session configuration in your DBT profile:
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://your-bucket-path
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  1. Don't use the datalake_formats configuration as it might cause conflicts.

  2. Ensure you're using Glue Version 5.0 or later and include the necessary Iceberg JAR files:

  • iceberg-aws-bundle-1.7.0.jar
  • iceberg-spark-runtime-3.3_2.12-1.7.1.jar
  1. In your model SQL, make sure you're using the correct table format. Since you're working with Iceberg tables, you should specify USING ICEBERG instead of USING PARQUET in your CREATE TABLE statement.

  2. Check that your IAM role has sufficient permissions for both AWS Glue catalog operations and S3 access to your data lake location.

By implementing these changes, you should be able to successfully run your DBT models with AWS Glue on Iceberg tables.
Sources
AWS Glue Interactive Sessions Query Iceberg and Non Iceberg Tables | AWS re:Post
Glue script job error spark_catalog requires a single-part namespace, but got [glue_catalog, foo] | AWS re:Post

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.