EMR Serverless not populating AWS Glue Catalog

Question

I want to use AWS Glue Data Catalog as a metastore.  I'm running an EMR Serverless job that inserts and updates data in a Delta Table.  I've successfully populated Delta tables on my localhost computer.  I'm trying populate the AWS Glue data catalog through my EMR Serverless job.  My EMR Serverless job currently runs without error - the only problem is the AWS Glue data catalog is not getting populated.

I've followed the instructions [here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html).

I start my EMR Serverless job via the AWS CLI.  I add the following Spark parameter configuration as directed in the above documentation:
```
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
```
I've also added glue:* permissions to the role that's used to execute my EMR Serverless job.  I've checked the AWS Glue console but I don't see the table under Data Catalog tables.  My Spark driver logs for my EMR Serverless job, specifically the standard error logs don't show anything regarding Glue.  The only Hive-related log I see is:
```
INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
```
which doesn't look too promising.

However, the EMR Serverless AWS console page for the job shows it recognizes AWS Glue Data Catalog as metastore in the *Metastore configuration* section.

![Enter image description here](/media/postImages/original/IMAbqERgfrSv-j9b0P7ZvVrw)

So am I doing something wrong or missing something?

Accepted Answer

I resolved the issue.  Unfortunately, the AWS documentation is missing a configuration setting.  The metastore configuration documentation is here:

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html#glue-metastore

In the Spark tab, the documentation is missing the following configuration setting to ensure that the catalog implementation is not using the in-memory Derby db which it defaults to if this configuration setting is missing:

```
--conf spark.sql.catalogImplementation=hive
```

Combined with the other documentation settings, I see the Delta table now show up on the AWS Glue Data Catalog.  Glue Data Catalog seems to have trouble correctly parsing the schema, but that is a question for another [post](https://repost.aws/questions/QUzQaMo22FQoSblQfHf-lDXg/aws-glue-data-catalog-cannot-determine-delta-table-classification).

Please update the AWS EMR Serverless documentation to include this configuration setting.  Thanks!

Answer

Hello,

As you updating Delta table that uses Glue catalog, may I ask you to test the below sample and let me know the outcome,

1. Configure your Spark session.

Configure the Spark Session. Set up Spark SQL extensions to use Delta lake.

```
%%configure -f
{
    "conf": {
        "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        "spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}

```

2. Create a Delta lake Table

We will create a Spark Dataframe with sample data and write this into a Delta lake table.
NOTE : You will need to update `my_bucket` in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.

```
tableName = "delta_table"
basePath = "s3://my_bucket/aws_workshop/delta_data_location/" + tableName

data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

data.write.format("delta"). \
  save(basePath)
```

3. Query the table

We will read the table using spark.read into a Spark dataframe

```
df = spark.read.format("delta").load(basePath)
df.show()
```

EMR Serverless not populating AWS Glue Catalog

関連するコンテンツ