How can use multiple catalogs in my EMR EC2 Cluster?

0

In my account, I have two Glue Catalogs (one is the default catalog, AWSDataCatalog, and another catalog is shared from a different account). How can I access the databases in both catalogs from EMR EC2 Cluster?

I have this configuration in my EMR Cluster: { "Classification": "hive-site", "Properties": { "aws.glue.catalog.separator": "/", "hive.metastore.glue.catalogid": "<my-accountid>", "hive.metastore.glue.datacatalog.enabled": "true" } }

If I have to access the catalog from a different account, do I need to have another entry for the other account ID in the configuration? Something like this:

{ "Classification": "hive-site", "Properties": { "aws.glue.catalog.separator": "/", "hive.metastore.glue.catalogid": "<my-accountid>", "hive.metastore.glue.catalogid": "<other-accountid>", "hive.metastore.glue.datacatalog.enabled": "true" } }

asked 10 days ago60 views
2 Answers
0

To use multiple catalogs in your EMR EC2 Cluster, including the default catalog in your account and a shared catalog from a different account, you don't need to specify multiple catalog IDs in a single configuration. Instead, you can use the following approach:

  1. Keep your existing configuration for the default catalog:

{ "Classification": "hive-site", "Properties": { "aws.glue.catalog.separator": "/", "hive.metastore.glue.catalogid": "<my-accountid>", "hive.metastore.glue.datacatalog.enabled": "true" } }

  1. To access the shared catalog from a different account, you don't need to add another entry in the configuration. Instead, you can specify the catalog ID (which is the AWS account ID) when querying the tables in that catalog.

For example, if you want to query a table named 'mytable' in a database 'mydb' that exists in the shared catalog with account ID '111122223333', you would use a query like this:

SELECT * FROM 111122223333/mydb.mytable

The backticks and forward slash are important here. The format is catalogID/database.table.

  1. Ensure that your EMR cluster has the necessary permissions to access the shared catalog. This typically involves setting up cross-account access permissions between your account and the account that owns the shared catalog.

  2. If you're using Spark, you might need to add a similar configuration for the "spark-hive-site" classification to ensure Spark can also access the AWS Glue Data Catalog.

By using this approach, you can access databases and tables from both your default catalog and the shared catalog without needing to specify multiple catalog IDs in your configuration.
Sources
Query an AWS Glue Data Catlog that's in another account with Amazon EMR | AWS re:Post
Using Presto with the AWS Glue Data Catalog - Amazon EMR

profile picture
answered 10 days ago
  • Spark is not recognizing the catalog separator "/" and it is throwing a syntax error.

0

For implementing multiple catalogs within an EMR EC2 Cluster environment, for both the default account catalog and a shared catalog from an external account, please implement the following approach :

The following AWS Glue Catalog resource policy needs to be included in the external account's catalog:

{
   "Effect":"Allow",
   "Principal":{
      "AWS":"arn:aws:iam::YOUR-ACCOUNT-ID:root"
   },
   "Action":"glue:*",
   "Resource":[
      "arn:aws:glue:us-east-1:CATALOG-ACCOUNT-ID:table/*/*",
      "arn:aws:glue:us-east-1:CATALOG-ACCOUNT-ID:database/glue",
      "arn:aws:glue:us-east-1:CATALOG-ACCOUNT-ID:catalog"
   ]
}

This policy grants full Glue permissions to the specified account (YOUR-ACCOUNT-ID) to access:

  1. All tables in all databases
  2. The specific database named "glue"
  3. The Glue Data Catalog

Note: Replace "YOUR-ACCOUNT-ID" with the account ID that requires access, and "CATALOG-ACCOUNT-ID" with the account ID where the external Glue Catalog resides.

The S3 bucket owner must implement the following bucket policy to grant necessary access permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::YOUR-ACCOUNT-ID:root"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::bucket-name",
                "arn:aws:s3:::bucket-name/*"
            ]
        }
    ]
}

This policy:

Grants read-only access to the specified AWS account (YOUR-ACCOUNT-ID)

Allows two specific actions:

  • s3:ListBucket: Permission to list objects in the bucket
  • s3:GetObject: Permission to retrieve objects from the bucket

Applies to both:

  • The bucket itself (arn:aws:s3:::bucket-name)
  • All objects within the bucket (arn:aws:s3:::bucket-name/*)

Note: Replace "YOUR-ACCOUNT-ID" with the account ID that requires access and "bucket-name" with the actual name of the S3 bucket.

When launching an Amazon EMR cluster, include the following configuration for AWS Glue Data Catalog integration:

[
   {
      "Classification":"hive-site",
      "Properties":{
        "aws.glue.catalog.separator": "/",
         "hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
         "hive.metastore.glue.catalogid":"YOUR-ACCOUNT-ID"
      }
   },
   {
      "Classification":"spark-hive-site",
      "Properties":{
         "aws.glue.catalog.separator": "/",
         "hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
         "hive.metastore.glue.catalogid":"YOUR-ACCOUNT-ID"
      }
   }
]

This configuration:

Sets up both Hive and Spark-Hive configurations

Specifies three key properties for each:

  • Defines the catalog separator as "/"
  • Sets the metastore client factory to use AWS Glue
  • Specifies the AWS Glue Catalog ID (account ID where the catalog resides)

Note: Replace "YOUR-ACCOUNT-ID" with the AWS account ID where the Glue Data Catalog is located.

Query tables in own AWS account

To retrieve data from a table within your current AWS account using Spark SQL, execute the following query:

spark.sql("SELECT * FROM DatabaseName.TableName LIMIT 5").show()

Query tables in another AWS account

When accessing tables located in a different AWS account using Spark SQL, you must include the account ID (catalog ID) in your query syntax. Here's how to do it: Format:

`ACCOUNT-ID/DATABASE-NAME`.TABLE-NAME

Example: To retrieve 5 records from a table with the following parameters:

  • Account ID: 111122223333
  • Database: testdb
  • Table: demotable1

Use this Spark SQL command:

spark.sql("SELECT * FROM `111122223333/testdb`.demotable1 LIMIT 5").show()

Note:

  • The account ID and database name must be enclosed in backticks (`)
  • A forward slash (/) separates the account ID and database name
  • The table name follows after a period (.)

This syntax enables cross-account table querying in AWS Glue Data Catalog.

AWS
answered 9 days ago
  • Hi Veera, Thanks for the answer. I am running into this error when I try to read an Iceberg-formatted table in the shared catalog from a different account. This is the error:

    org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable_name. StorageDescriptor#InputFormat cannot be null for table: mytable_name

    I can read an Iceberg table from the same account without any issues. This is my sparksession configuration:

    %%configure -f { "conf": { "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog", "spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.spark_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO", "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.defaultCatalog": "spark_catalog", "spark.dynamicAllocation.enabled": "false", "spark.pyspark.python": "python", "spark.pyspark.virtualenv.enabled": "true", "spark.pyspark.virtualenv.type":"native", "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv", "spark.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.5"

    }
    

    }

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions