Connecting to Glue Hive Data Catalog from EC2 or Local Computer with Spark

Question

Hi,
I built Iceberg table that uses Glue as the Hive catalog. Team members I work with want to connect to it using Spark. They run Spark locally on their laptop and want to read the table or they have Spark running locally in an Airflow Task on an EC2 and want to connect to it. 
Is that possible to configure Spark not running on Glue or EMR to connect to Glue as the Hive Metastore? If so some examples would be appreciative.

We set this conf when running Iceberg  "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory".
Is this a JAR I can add to any Spark application that allows it to connect to AWS Glue as the Hive site or only works on EMR?

Answer

Unfortunately, it's not enough to configure that and add the library, the Spark distribution needs to be patched to work with the Glue catalog client.  
You can build that distribution yourself following the instructions here: https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore but it's much easier to extract the patched jars from EMR

Connecting to Glue Hive Data Catalog from EC2 or Local Computer with Spark

相關內容