- Newest
- Most votes
- Most comments
To use multiple catalogs in your EMR EC2 Cluster, including the default catalog in your account and a shared catalog from a different account, you don't need to specify multiple catalog IDs in a single configuration. Instead, you can use the following approach:
- Keep your existing configuration for the default catalog:
{ "Classification": "hive-site", "Properties": { "aws.glue.catalog.separator": "/", "hive.metastore.glue.catalogid": "<my-accountid>", "hive.metastore.glue.datacatalog.enabled": "true" } }
- To access the shared catalog from a different account, you don't need to add another entry in the configuration. Instead, you can specify the catalog ID (which is the AWS account ID) when querying the tables in that catalog.
For example, if you want to query a table named 'mytable' in a database 'mydb' that exists in the shared catalog with account ID '111122223333', you would use a query like this:
SELECT * FROM 111122223333/mydb.mytable
The backticks and forward slash are important here. The format is catalogID/database.table
.
-
Ensure that your EMR cluster has the necessary permissions to access the shared catalog. This typically involves setting up cross-account access permissions between your account and the account that owns the shared catalog.
-
If you're using Spark, you might need to add a similar configuration for the "spark-hive-site" classification to ensure Spark can also access the AWS Glue Data Catalog.
By using this approach, you can access databases and tables from both your default catalog and the shared catalog without needing to specify multiple catalog IDs in your configuration.
Sources
Query an AWS Glue Data Catlog that's in another account with Amazon EMR | AWS re:Post
Using Presto with the AWS Glue Data Catalog - Amazon EMR
For implementing multiple catalogs within an EMR EC2 Cluster environment, for both the default account catalog and a shared catalog from an external account, please implement the following approach :
The following AWS Glue Catalog resource policy needs to be included in the external account's catalog:
{
"Effect":"Allow",
"Principal":{
"AWS":"arn:aws:iam::YOUR-ACCOUNT-ID:root"
},
"Action":"glue:*",
"Resource":[
"arn:aws:glue:us-east-1:CATALOG-ACCOUNT-ID:table/*/*",
"arn:aws:glue:us-east-1:CATALOG-ACCOUNT-ID:database/glue",
"arn:aws:glue:us-east-1:CATALOG-ACCOUNT-ID:catalog"
]
}
This policy grants full Glue permissions to the specified account (YOUR-ACCOUNT-ID) to access:
- All tables in all databases
- The specific database named "glue"
- The Glue Data Catalog
Note: Replace "YOUR-ACCOUNT-ID" with the account ID that requires access, and "CATALOG-ACCOUNT-ID" with the account ID where the external Glue Catalog resides.
The S3 bucket owner must implement the following bucket policy to grant necessary access permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::YOUR-ACCOUNT-ID:root"
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::bucket-name",
"arn:aws:s3:::bucket-name/*"
]
}
]
}
This policy:
Grants read-only access to the specified AWS account (YOUR-ACCOUNT-ID)
Allows two specific actions:
- s3:ListBucket: Permission to list objects in the bucket
- s3:GetObject: Permission to retrieve objects from the bucket
Applies to both:
- The bucket itself (arn:aws:s3:::bucket-name)
- All objects within the bucket (arn:aws:s3:::bucket-name/*)
Note: Replace "YOUR-ACCOUNT-ID" with the account ID that requires access and "bucket-name" with the actual name of the S3 bucket.
When launching an Amazon EMR cluster, include the following configuration for AWS Glue Data Catalog integration:
[
{
"Classification":"hive-site",
"Properties":{
"aws.glue.catalog.separator": "/",
"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
"hive.metastore.glue.catalogid":"YOUR-ACCOUNT-ID"
}
},
{
"Classification":"spark-hive-site",
"Properties":{
"aws.glue.catalog.separator": "/",
"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
"hive.metastore.glue.catalogid":"YOUR-ACCOUNT-ID"
}
}
]
This configuration:
Sets up both Hive and Spark-Hive configurations
Specifies three key properties for each:
- Defines the catalog separator as "/"
- Sets the metastore client factory to use AWS Glue
- Specifies the AWS Glue Catalog ID (account ID where the catalog resides)
Note: Replace "YOUR-ACCOUNT-ID" with the AWS account ID where the Glue Data Catalog is located.
Query tables in own AWS account
To retrieve data from a table within your current AWS account using Spark SQL, execute the following query:
spark.sql("SELECT * FROM DatabaseName.TableName LIMIT 5").show()
Query tables in another AWS account
When accessing tables located in a different AWS account using Spark SQL, you must include the account ID (catalog ID) in your query syntax. Here's how to do it: Format:
`ACCOUNT-ID/DATABASE-NAME`.TABLE-NAME
Example: To retrieve 5 records from a table with the following parameters:
- Account ID: 111122223333
- Database: testdb
- Table: demotable1
Use this Spark SQL command:
spark.sql("SELECT * FROM `111122223333/testdb`.demotable1 LIMIT 5").show()
Note:
- The account ID and database name must be enclosed in backticks (`)
- A forward slash (/) separates the account ID and database name
- The table name follows after a period (.)
This syntax enables cross-account table querying in AWS Glue Data Catalog.
Hi Veera, Thanks for the answer. I am running into this error when I try to read an Iceberg-formatted table in the shared catalog from a different account. This is the error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable_name. StorageDescriptor#InputFormat cannot be null for table: mytable_name
I can read an Iceberg table from the same account without any issues. This is my sparksession configuration:
%%configure -f { "conf": { "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog", "spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.spark_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO", "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.defaultCatalog": "spark_catalog", "spark.dynamicAllocation.enabled": "false", "spark.pyspark.python": "python", "spark.pyspark.virtualenv.enabled": "true", "spark.pyspark.virtualenv.type":"native", "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv", "spark.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.5"
}
}
Relevant content
- asked 2 years ago
- asked a year ago
- asked 10 months ago
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 years ago
Spark is not recognizing the catalog separator "/" and it is throwing a syntax error.