Skip to content

Cross-Account Access for AWS Glue Job (Account A) to Access Glue Data Catalog and Iceberg Tables (Account B)

0

Hi All,

I'm trying to enable cross-account access where a Glue job running in Account A needs to read/write Iceberg tables that reside in the Glue Data Catalog of Account B.

What I’ve Tried: Approach 1: Direct IAM Role Access (Account A role) + Resource Policy in Account B

  1. Glue job in Account A runs with an IAM role (let’s call it GlueRoleA)
  2. In Account B, Resource policy on the Glue Data Catalog grants access to GlueRoleA
  3. S3 bucket policy allows access to data bucket from GlueRoleA
  4. LF Data permission grants (including ALL and SUPER) are given to GlueRoleA

Approach 2: Assume Role

  1. In Account A, Glue job role (GlueRoleA) assumes a role in Account B (GlueRoleB)
  2. In Account B, GlueRoleB has all required Glue and Lake Formation permissions (including ALL and SUPER)
  3. S3 bucket policy allows access from GlueRoleB
  4. Lake Formation grants are provided to GlueRoleB

In both approaches, when using GlueContext to access Iceberg tables, I encounter the following error: TABLE_OR_VIEW_NOT_FOUND

Please find my code below:

import sys
import boto3
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
# Initialize Spark Context with the required Spark configurations for Iceberg
spark = SparkSession.builder \
    .appName("CrossAccountGlueIcebergJob") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://<bucket_name>/iceberg_warehouse/") \
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.catalog.glue_catalog.catalog-id", "111122223333") \
    .config(f"spark.sql.catalog.glue_catalog.glue.account-id", "111122223333") \
    .config("spark.sql.defaultCatalog", "glue_catalog") \
    .getOrCreate()

# Initialize GlueContext
glueContext = GlueContext(spark.sparkContext)
# Debugging: Print Spark configurations related to Iceberg catalog setup
print("Spark Configurations for Iceberg Catalog:")
for key, value in spark.sparkContext.getConf().getAll():
    if "spark.sql.catalog" in key:
        print(f"{key} = {value}")
# Create a Boto3 Glue Client for cross-account operations
glue_client = boto3.client('glue', region_name='us-east-1')
account_b_id = "111122223333"  # Replace with Account B ID
# List tables in the specific database in Account B
print("Fetching tables from Account B Glue Data Catalog...")
response = glue_client.get_tables(CatalogId=account_b_id, DatabaseName='welcome_db')
print("Tables in database 'welcome_db':")
for tbl in response['TableList']:
    print("Table:", tbl['Name'])

print("Attempting to read table")
 
df_csv = glueContext.create_data_frame.from_catalog(
    database="welcome_db",
    table_name="dummy",
    catalog_id = account_b_id
)
 
df_csv.show(10)
 
 
for key, value in spark.sparkContext.getConf().getAll():
    if "spark.sql.catalog" in key:
        print(f"{key} = {value}")
 
try:
    df_iceberg = glueContext.create_data_frame.from_catalog(
        database="welcome_db",
        table_name="sample_iceberg_tbl",
        catalog_id = account_b_id
    )
    df_iceberg.show(10)
except Exception as e:
    raise e
finally:
    for key, value in spark.sparkContext.getConf().getAll():
        if "spark.sql.catalog" in key:
            print(f"{key} = {value}")
 

print("Job Completed Successfully.")

Error: Error Category: QUERY_ERROR; Failed Line Number: 66; Spark Error Class: TABLE_OR_VIEW_NOT_FOUND; AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view glue_catalog.welcome_db.sample_iceberg_tbl cannot be found. Verify the spelling and correctness of the schema and catalog.

Note: df_csv table with csv data file works. But df_iceberg is not failing, TABLE_OR_VIEW_NOT_FOUND!

Appreciating your help, Thanks in Advance

1 Answer
0

To enable cross-account access for AWS Glue jobs to read/write Iceberg tables in another account's Glue Data Catalog, I can see a few issues with your current approaches.

The key difference between your working CSV table access and failing Iceberg table access is related to how Iceberg tables are handled in cross-account scenarios. Let me explain the issues and provide a solution:

Main Issue

When using GlueContext.create_data_frame.from_catalog() with a catalog_id parameter, it works for standard Glue tables (like your CSV table), but for Iceberg tables, you need to use Spark SQL with the properly configured catalog.

Solution

  1. Resource Policy Configuration:
  • In Account B, ensure your Glue Data Catalog resource policy includes the necessary permissions for Account A's role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": ["arn:aws:iam::ACCOUNT-A-ID:role/GlueRoleA"]
},
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"arn:aws:glue:us-east-1:ACCOUNT-B-ID:catalog",
"arn:aws:glue:us-east-1:ACCOUNT-B-ID:database/welcome_db",
"arn:aws:glue:us-east-1:ACCOUNT-B-ID:table/welcome_db/*"
]
}
]
}
  1. Lake Formation Configuration:
  • If using Lake Formation, add a statement to the Data Catalog resource policy to permit cross-account grants:
{
"Effect": "Allow",
"Principal": {
"Service": "ram.amazonaws.com"
},
"Action": "glue:ShareResource",
"Resource": [
"arn:aws:glue:us-east-1:ACCOUNT-B-ID:table/*/*",
"arn:aws:glue:us-east-1:ACCOUNT-B-ID:database/*",
"arn:aws:glue:us-east-1:ACCOUNT-B-ID:catalog"
]
}
  1. Code Modification: For Iceberg tables, instead of using create_data_frame.from_catalog(), use Spark SQL directly:
# For Iceberg tables, use Spark SQL with the properly configured catalog
spark.sql(f"USE glue_catalog.welcome_db")
df_iceberg = spark.sql("SELECT * FROM sample_iceberg_tbl")
df_iceberg.show(10)

Or alternatively:

df_iceberg = spark.sql("SELECT * FROM glue_catalog.welcome_db.sample_iceberg_tbl")
df_iceberg.show(10)
  1. Spark Configuration Adjustments: Make sure your Spark configuration includes these settings:
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config("spark.sql.catalog.glue_catalog.warehouse", "s3://<bucket_name>/iceberg_warehouse/")
.config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.glue_catalog.catalog-id", "111122223333")
.config("spark.sql.defaultCatalog", "glue_catalog")

If you're using Lake Formation for permissions management, ensure you've granted the appropriate Lake Formation permissions to the role in Account A, and consider using hybrid access mode if you're transitioning from AWS Glue permissions to Lake Formation permissions.

Remember that both the resource policy in Account B and the IAM permissions in Account A need to be properly configured for cross-account access to work correctly.
Sources
Managing cross-account permissions using both AWS Glue and Lake Formation - AWS Lake Formation
Configure cross-account AWS Glue access in Athena for Spark - Amazon Athena
Granting cross-account access - AWS Glue
Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight | AWS Big Data Blog

answered 10 months ago
AWS
EXPERT
reviewed 10 months ago
AWS
SUPPORT ENGINEER
reviewed 10 months ago
  • I have tried this method, its not working. I am unable to read iceberg table from different aws account via glue script

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.