Integrating Glue Catalog with EMR on EKS Managed Endpoint for EMR Studio

0

In EMR Studio, when attaching an EMR Virtual cluster to the notebook, Glue catalog is not accessible. Some common errors on trying to access Glue are:

  1. "Hive support is required to ..."
  2. "Table or view not found ..."

Adding enableHiveSupport() to the spark statement does not seem to work either. For example:

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .enableHiveSupport() \
    .getOrCreate()

What configuration is required when connecting from the notebook kernel to make Glue catalog accessible to the notebook?

AWS
質問済み 3年前1017ビュー
1回答
0
承認された回答

For EMR Studio to connect to EMR on EKS, a managed endpoint needs to be created. This managed endpoint needs to be configured to use Hive as the catalog and to point to Glue. Use the following CLI to configure the managed endpoint that is able to connect to Glue as the catalog:

aws emr-containers create-managed-endpoint \
--type JUPYTER_ENTERPRISE_GATEWAY \
--virtual-cluster-id ${virtclusterid} \
--name virtual-emr-endpoint \
--execution-role-arn ${role_arn} \
--release-label ${emr_release_label} \
--certificate-arn ${certarn} \
--region ${region} \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
          "spark.sql.catalogImplementation": "hive"
        }
      }
    ]
  }'

See the following documentation for more information about the various flags and description: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-create-eks-cluster.html

AWS
回答済み 3年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ