Integrating Glue Catalog with EMR on EKS Managed Endpoint for EMR Studio

0

In EMR Studio, when attaching an EMR Virtual cluster to the notebook, Glue catalog is not accessible. Some common errors on trying to access Glue are:

  1. "Hive support is required to ..."
  2. "Table or view not found ..."

Adding enableHiveSupport() to the spark statement does not seem to work either. For example:

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .enableHiveSupport() \
    .getOrCreate()

What configuration is required when connecting from the notebook kernel to make Glue catalog accessible to the notebook?

AWS
질문됨 3년 전1017회 조회
1개 답변
0
수락된 답변

For EMR Studio to connect to EMR on EKS, a managed endpoint needs to be created. This managed endpoint needs to be configured to use Hive as the catalog and to point to Glue. Use the following CLI to configure the managed endpoint that is able to connect to Glue as the catalog:

aws emr-containers create-managed-endpoint \
--type JUPYTER_ENTERPRISE_GATEWAY \
--virtual-cluster-id ${virtclusterid} \
--name virtual-emr-endpoint \
--execution-role-arn ${role_arn} \
--release-label ${emr_release_label} \
--certificate-arn ${certarn} \
--region ${region} \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
          "spark.sql.catalogImplementation": "hive"
        }
      }
    ]
  }'

See the following documentation for more information about the various flags and description: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-create-eks-cluster.html

AWS
답변함 3년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인