- Newest
- Most votes
- Most comments
Hello,
Assuming that you have built the jars as mentioned in the instructions https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore for your specific Spark version.
I was able to successfully connect to my Glue catalog tables by following the below steps
-
A Spark Docker image I have built and pushed to an ECR repo, following the instructions provided[1].
-
A new Spark Docker image I have built by including the Glue Hive catalog client jars mentioned on the GitHub page, on top of the previously I have created base Spark image. This patched image was also pushed to the ECR repo.
-
An EKS cluster was created, along with a namespace and service account specifically for Spark jobs.
-
I have downloaded spark on my computer and wrote a small pyspark script to read from my Glue table
-
Finally, I have used the below “spark-submit” command which ran successfully
spark-submit --master k8s://https://<Kubernetes url> --deploy-mode cluster --name spark-pi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=<IMAGE_NAME> --conf spark.kubernetes.namespace=<NAMESPACE> --conf spark.kubernetes.executor.request.cores=1 --conf spark.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.hive.metastore.glue.catalogid=<AWS ACCOUNT ID> --conf spark.hive.imetastoreclient.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.kubernetes.file.upload.path=s3a://Bucket/ --conf spark.kubernetes.authenticate.driver.serviceAccountName=<SERVICE ACCOUNT NAME> script.py
Hope this information helps!
--Reference-- [1]https://spark.apache.org/docs/latest/running-on-kubernetes.html#:~:text=It%20can%20be%20found%20in,use%20with%20the%20Kubernetes%20backend
Relevant content
- asked 7 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago

For running a Spark job using spark operator deployed on EKS cluster to read the Glue Data Catalog table, we tried following the steps mentioned in the [https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore] and build the jars successfully by excluding a few dependencies for branch_3.1 as per the resolution mentioned here. But we are still not able to query the Glue Data Catalog tables. I'm suspecting that the jars are not compatible with the Spark 3.5.4 version that we are using. Can someone please help me how can I select specific jars based on my Spark 3.5.4 version? Appreciate any help here.