Connecting to Glue Hive Data Catalog from EC2 or Local Computer with Spark

1

Hi, I built Iceberg table that uses Glue as the Hive catalog. Team members I work with want to connect to it using Spark. They run Spark locally on their laptop and want to read the table or they have Spark running locally in an Airflow Task on an EC2 and want to connect to it. Is that possible to configure Spark not running on Glue or EMR to connect to Glue as the Hive Metastore? If so some examples would be appreciative.

We set this conf when running Iceberg "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". Is this a JAR I can add to any Spark application that allows it to connect to AWS Glue as the Hive site or only works on EMR?

Thomas
asked a year ago298 views
1 Answer
1

Unfortunately, it's not enough to configure that and add the library, the Spark distribution needs to be patched to work with the Glue catalog client.
You can build that distribution yourself following the instructions here: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore but it's much easier to extract the patched jars from EMR

profile pictureAWS
EXPERT
answered a year ago
  • Having a similar situation with Hudi tables. How do you extract the patched jars from EMR? Can you link to the documentation?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions