Skip to content

How do I use Hive and Spark on Amazon EMR to query a Data Catalog that's in a different account?

4 minute read
0

I want to use Apache Hive and Apache Spark to access and query another AWS account's AWS Glue Data Catalog in Amazon EMR.

Short description

To dynamically access Data Catalog in different accounts, specify the property aws.glue.catalog.separator in your Hive or Spark configurations. You can also specify the property hive.metastore.glue.catalogid in your configurations to access a specific Data Catalog in another account.

Resolution

Grant cross-account access to AWS Glue. Be sure that the Amazon Simple Storage Service (Amazon S3) bucket that the AWS Glue tables point to is configured for cross-account access. For more information, see How do I set up cross-account access for EMR File System?

Dynamically access Data Catalog in different accounts

Apply configuration changes to a new cluster

When you launch an EMR cluster, add a configuration object similar to the following example. For more information, see Configure applications when you create a cluster.

[
    {
        "Classification": "hive-site",
        "Properties": {
            "aws.glue.catalog.separator": "/",
            "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
    },
    {
        "Classification": "spark-hive-site",
        "Properties": {
            "aws.glue.catalog.separator": "/",
            "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
    }
]

Apply configuration changes to a running cluster

Complete the following steps:

  1. Open the Amazon EMR console.
  2. In the cluster list, under Cluster Id, choose the active cluster that you want to reconfigure.
  3. In the cluster details page, choose the Applications tab.
  4. Under Instance group configurations, select the instance group, and then choose Reconfigure.
  5. Choose Add new configuration, and then enter the following information:
    For Classification, enter hive-site.
    For Property, enter aws.glue.catalog.separator.
    For Value, enter /.
    For Classification, enter hive-site.
    For Property, enter hive.metastore.client.factory.class.
    For Value, enter com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.
    For Classification, enter spark-hive-site.
    For Property, enter aws.glue.catalog.separator.
    For Value, enter /.
    For Classification, enter spark-hive-site.
    For Property, enter hive.metastore.client.factory.class.
    For Value, enter com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.
  6. Select Apply this configuration to all active instance groups, and then choose Save changes. For more information, see Reconfigure an instance group in a running cluster.

Query tables in another account

Note: When you run Spark jobs on Amazon EMR version 6.5, pass the configuration property spark.sql.catalogImplementation=hive. For more information, see Dynamically loading Spark properties on the Apache Spark website.

To query a table that's in a different account, specify the account number in the query. The account number is the same as the catalog ID. For example, to query demodb.tab1 in account 111122223333, run the following command.

Hive:

SELECT * FROM `111122223333/demodb.tab1` LIMIT 5;

Spark:

spark.sql("SELECT * FROM `111122223333/demodb`.tt1 LIMIT 5").show()

Run the preceding Spark command in the spark-submit script, or as a notebook shell command.

You can also join tables across two catalogs.

Hive:

SELECT * FROM `111122223333/demodb.tab1` t1 INNER JOIN `444455556666/demodb.tab2` t2 ON t1.col1 = t2.col2

Spark:

spark.sql(SELECT * FROM `111122223333/demodb.tab1` t1 INNER JOIN `444455556666/demodb.tab2` t2 ON t1.col1 = t2.col2).show()

Run the preceding Spark command in the spark-submit script, or as a notebook shell command.

Access a specific Data Catalog in another account

Apply configuration changes to a new cluster

When you launch an EMR cluster, add a configuration object similar to the following example. For more information, see Configure applications when you create a cluster.

[
    {
        "Classification": "hive-site",
        "Properties": {
            "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
            "hive.metastore.glue.catalogid": "account-id"
        }
    },
    {
        "Classification": "spark-hive-site",
        "Properties": {
            "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
            "hive.metastore.glue.catalogid": "account-id"
        }
    }
]

Apply configuration changes to a running cluster

Complete the following steps:

  1. Open the Amazon EMR console.
  2. In the cluster list, under Cluster Id, choose the active cluster that you want to reconfigure.
  3. In the cluster details page, choose the Applications tab.
  4. Under Instance group configurations, select the instance group, and then choose Reconfigure.
  5. Choose Add new configuration, and then enter the following information:
    For Classification, enter hive-site.
    For Property, enter hive.metastore.glue.catalogid.
    For Value, enter account-id.
    For Classification, enter hive-site.
    For Property, enter hive.metastore.client.factory.class.
    For Value, enter com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.
    For Classification, enter spark-hive-site.
    For Property, enter hive.metastore.glue.catalogid.
    For Value, enter account-id.
    For Classification, enter spark-hive-site.
    For Property, enter hive.metastore.client.factory.class.
    For Value, enter com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.
  6. Check Apply this configuration to all active instance groups, and then choose Save changes. For more information, see Reconfigure an instance group in a running cluster.

Query tables in another account

To query tables that are stored in the specified Data Catalog, run a command similar to the following example.

Hive:

SELECT * FROM `demodb.tab1` LIMIT 5;

Spark:

spark.sql("SELECT * FROM `demodb`.tt1 LIMIT 5").show()
AWS OFFICIALUpdated 6 months ago
1 Comment

In my account, I have two Glue Catalogs (one is the default catalog, AWSDataCatalog, and another catalog is shared from a different account). How can I access the databases in both catalogs from EMR EC2 Cluster? I have this configuration in my EMR Cluster: { "Classification": "hive-site", "Properties": { "aws.glue.catalog.separator": "/", "hive.metastore.glue.catalogid": "<my-accountid>", "hive.metastore.glue.datacatalog.enabled": "true" } }

If I have to access the catalog from a different account, do I need to have another entry for the other account ID? Something like this:

{ "Classification": "hive-site", "Properties": { "aws.glue.catalog.separator": "/", "hive.metastore.glue.catalogid": "<my-accountid>", "hive.metastore.glue.catalogid": "<other-accountid>", "hive.metastore.glue.datacatalog.enabled": "true" } }

replied 8 months ago