Integrate Glue Catalog with own Spark Application deployed on EKS

0

we have deployed Apache Spark into a kubernetes cluster by our own. In the past, in EMR, setting "hive.metastore.client.factory.class" was enough to use glue catalog. Unfortunattely, In our own deployment, Spark don't see glue databases. No exception is logged by Spark.

Our configuration:

spark = SparkSession .builder() .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") .enableHiveSupport()

The Client Factory .jar package we built from: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore

could someone help?

Best regards,

질문됨 일 년 전2008회 조회
1개 답변
0

Hello,

Assuming that you have built the jars as mentioned in the instructions https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore  for your specific Spark version.

I was able to successfully connect to my Glue catalog tables by following the below steps

  1. A Spark Docker image I have built and pushed to an ECR repo, following the instructions provided[1].

  2. A new Spark Docker image I have built by including the Glue Hive catalog client jars mentioned on the GitHub page, on top of the previously I have created base Spark image. This patched image was also pushed to the ECR repo.

  3. An EKS cluster was created, along with a namespace and service account specifically for Spark jobs.

  4. I have downloaded spark on my computer and wrote a small pyspark script to read from my Glue table

  5. Finally, I have used the below “spark-submit” command which ran successfully

spark-submit --master k8s://https://<Kubernetes url> --deploy-mode cluster --name spark-pi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=<IMAGE_NAME> --conf spark.kubernetes.namespace=<NAMESPACE> --conf spark.kubernetes.executor.request.cores=1 --conf spark.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.hive.metastore.glue.catalogid=<AWS ACCOUNT ID> --conf spark.hive.imetastoreclient.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.kubernetes.file.upload.path=s3a://Bucket/ --conf spark.kubernetes.authenticate.driver.serviceAccountName=<SERVICE ACCOUNT NAME> script.py

Hope this information helps!

--Reference-- [1]https://spark.apache.org/docs/latest/running-on-kubernetes.html#:~:text=It%20can%20be%20found%20in,use%20with%20the%20Kubernetes%20backend

AWS
지원 엔지니어
Durga_B
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인