How can I set up a cross-account AWS Glue ETL job for tables using catalog resource policies?

4 minute read
0

I want to access tables in account A from account B using ETL jobs in the AWS Glue Data Catalog, without using AWS Lake Formation.

Short description

You can access cross-accounts in the AWS Glue Data Catalog with or without using AWS Lake Formation. The following sections outline the setup to access a cross-accounts catalog using only AWS Glue.

If you're using Lake Formation, see Cross-account data sharing in Lake Formation for more information on setting up a cross-account resource share.

Note: The steps describe how you can access cross-accounts within a single AWS Region. They don’t address access to resources located in a different AWS Region.

Resolution

Set up access policies in source and target accounts

Use the following steps to grant resource-level permissions to account B from account A's AWS Glue Data Catalog.

Note: Account A has the AWS Glue Data Catalog resources and account B is the extract, transform, and load (ETL) account. In addition, account A has the resource-based policy modifications while account B holds some of the AWS Identity and Access Management (IAM) policy modifications.

Attach a catalog-resource policy in account A

1.    Log in to the AWS Management console.

2.    In the search bar, search for AWS Glue. Choose Get Started with AWS Glue.

3.    From the left panel, choose Catalog settings.

4.    Under Permissions, enter the following resource policy. This resource policy lets account B access the databases and tables in account A.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::1111222233334444:root"
      },
      "Action": "glue:*",
      "Resource": [
        "arn:aws:glue:us-east-1:5555666677778888:catalog",
        "arn:aws:glue:us-east-1:5555666677778888:database/doc_example_DB",
        "arn:aws:glue:us-east-1:5555666677778888:table/doc_example_DB/*"
      ]
    }
  ]
}

Note: Replace the following values in the policy:

  • 1111222233334444 with the account ID for account B
  • 5555666677778888 with the account ID for account A
  • us-east-1 with the Region of your choice
  • doc_example_DB with the name of your database

5.    (OPTIONAL) You can limit access to a specific role in account A by including the Amazon Resource Name (ARN) of the role in the policy. For example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::1111222233334444:role/service-role/AWSGlueServiceRole_Glue_Test"
      },
      "Action": "glue:*",
      "Resource": [
        "arn:aws:glue:us-east-1:5555666677778888:catalog",
        "arn:aws:glue:us-east-1:5555666677778888:database/doc_example_DB",
        "arn:aws:glue:us-east-1:5555666677778888:table/doc_example_DB/*"
      ]
    }
  ]
}

Note: Replace the following values in the policy:

  • 1111222233334444 with the account ID for account B
  • 5555666677778888 with the account ID for account A
  • us-east-1 with the Region of your choice
  • doc_example_DB with the name of your database
  • AWSGlueServiceRole_Glue_Test with the ARN of the role that's used to run the ETL job

Attach an IAM policy in account B

The IAM user in account B that runs the ETL job needs access to the databases and tables in account A.

Note: If you're using Amazon Athena with the Data Catalog, then include the default database in the policy. This inclusion makes sure that the GetDatabase and CreateDatabase actions succeed. For more information, see Default database and catalog per AWS Region.

1.    Log in to the IAM console with your AWS login credentials.

2.    From the left panel, choose Roles.

3.    Choose the role name that you're using inside the ETL script.

4.    Attach an IAM policy to the AWS Glue ETL job's IAM role in account B. This gives you access to the database and tables in account A:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetConnection",
        "glue:GetTable",
        "glue:GetPartition"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:5555666677778888:catalog",
        "arn:aws:glue:us-east-1:5555666677778888:database/default",
        "arn:aws:glue:us-east-1:5555666677778888:database/doc_example_DB",
        "arn:aws:glue:us-east-1:5555666677778888:table/doc_example_DB/*"
      ]
    }
  ]
}

Note: Replace the following values in the policy:

  • 5555666677778888 with the account ID for account A
  • us-east-1 with the Region of your choice
  • doc_example_DB with the name of your database

5.    Verify that the policy you created is attached to the IAM role in account B.

6.    Test if account B has access to the Data Catalog in account A. Create an ETL job with the following scripts:

Dynamic frame script:

df = glueContext.create_dynamic_frame.from_catalog(database="doc_example_DB", table_name="doc_example_table", catalog_id="5555666677778888", region="us-east-1")

Data frame script:

"""Create Spark Session with cross-account AWS Glue Data Catalog"""
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("Spark Glue Example") \
.config("hive.metastore.client.factory.class", \
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.metastore.glue.catalogid", "5555666677778888") \
.enableHiveSupport() \
.getOrCreate()

table_df = spark_session.sql("SELECT * FROM doc_example_DB.doc_example_table limit 10")

table_df.show()

Note: Replace the following values in the policy:

  • 5555666677778888 with the account ID for account A
  • doc_example_DB with the name of your database
  • doc_example_table with the name of your table
  • us-east-1 with the Region of your choice

Related information

Granting cross-account access

Specifying AWS Glue resource ARNs

About upgrading to the Lake Formation permissions model

Migration between the Hive metastore and the AWS Glue Data Catalog

AWS Glue resource policies for access control

AWS OFFICIAL
AWS OFFICIALUpdated a year ago