Cannot read cross-account Iceberg table with Glue or EMR

0

Hi!

I am trying to use DataZone to share an Iceberg table from the Glue Data Catalog with another AWS account. I have created the table with Athena in the source account like this:

CREATE TABLE iceberg_table (
  id int,
  data string,
  category string) 
PARTITIONED BY (category, bucket(16,id)) 
LOCATION 's3://************/dzd_ceozi0qzepfll7/datazone/409ty6lk11tpqj/' 
TBLPROPERTIES (
  'table_type'='ICEBERG',
  'format'='parquet',
  'write_compression'='snappy',
  'optimize_rewrite_delete_file_threshold'='10'
)

INSERT INTO "iceberg_table" ("id", "data", "category")
VALUES (1, 'my data', '100'),
(2, 'hello', '200'),
(3, 'this', '100'),
(4, 'is', '200'),
(5, 'a test', '300'); 

Then I registered it as a data asset in DataZone and shared it with a different project that has an environment in another AWS account. So far, so good. I was then able to query the shared table in the other account with Athena. Then I though it wouldn't be a big thing to read the table from EMR (or in a Glue notebook). I started a Glue notebook with an IAM role that has the necessary Glue, S3, and Lake Formation permissions. This is the code of the notebook:

%idle_timeout 60
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2
%%configure
{
"--conf":"spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"--datalake-formats":"iceberg"
}

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession
catalog_nm = "glue_catalog"
s3_bucket = "s3://************/dzd_ceozi0qzepfll7/datazone/4qaoomp2oefmkr/"
spark = SparkSession.builder \
    .config("spark.sql.defaultCatalog", catalog_nm) \
    .config(f"spark.sql.catalog.{catalog_nm}",
        "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_nm}.warehouse", s3_bucket) \
    .config(f"spark.sql.catalog.{catalog_nm}.catalog-impl",
        "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_nm}.io-impl",
        "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)

%%sql
show databases

%%sql
show tables in consumerdatalake_sub_db

%%sql
select * from glue_catalog.consumerdatalake_sub_db.iceberg_table limit 10

Both, the show databases and show tables in .. are working fine. But the last select statement results in the following error:

Py4JJavaError: An error occurred while calling o77.sql. : org.apache.iceberg.exceptions.ValidationException: Input Glue table is not an iceberg table: glue_catalog.consumerdatalake_sub_db.iceberg_table (type=null) at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49) at org.apache.iceberg.aws.glue.GlueToIcebergConverter.validateTable(GlueToIcebergConverter.java:48) at org.apache.iceberg.aws.glue.GlueTableOperations.doRefresh(GlueTableOperations.java:116) at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:95) at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:78) at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:43) .....

It seems like Glue cannot infer that the table type is actually "iceberg". As a test, I created the same Iceberg table in the default Glue catalog of the subscriber account and I was able to query that table without any issues. I compared the Glue tables and both say "Table Type: ICEBERG". Are there any restrictions when trying to read a shared Iceberg table? Any idea what could be missing?

When I do the same for a CSV table instead of ICEBERG, I get a similar error:

df = glueContext.create_data_frame.from_catalog(database='consumerdatalake_sub_db', table_name='customers')
df.show()

Py4JJavaError: An error occurred while calling o82.getCatalogSource. : java.lang.Error: No classification or connection in consumerdatalake_sub_db.customers

In the Glue tables overview I can see that there are no "classifications" set for the shared tables.

  • It seems that this has to do with the Lake Formation Resource Links that are created for the shared tables. I was able to query both, the CSV and Iceberg data, by granting Lake Formation permissions to the tables in the producer data catalog: producerdatalake_pub_db.customers and producerdatalake_pub_db.iceberg_table.

    Here I read that EMR and Glue can access shared tables directly: https://aws.github.io/aws-lakeformation-best-practices/data-sharing/general-data-sharing/#resource-links

    But the AWS docs say that it should also be possible via resource links: https://docs.aws.amazon.com/lake-formation/latest/dg/resource-links-about.html

    Creating a resource link to a database or table enables you to do the following: Access the Data Catalog databases and tables from any AWS Region by creating resource links in those regions pointing to the database and tables in another region. You can run queries in any region with these resource links using Athena, Amazon EMR and run AWS Glue ETL Spark jobs, without copying source data nor the metadata in Glue Data Catalog.

profile picture
asked 21 days ago82 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions