EMR Serverless not populating AWS Glue Catalog

0

I want to use AWS Glue Data Catalog as a metastore. I'm running an EMR Serverless job that inserts and updates data in a Delta Table. I've successfully populated Delta tables on my localhost computer. I'm trying populate the AWS Glue data catalog through my EMR Serverless job. My EMR Serverless job currently runs without error - the only problem is the AWS Glue data catalog is not getting populated.

I've followed the instructions here.

I start my EMR Serverless job via the AWS CLI. I add the following Spark parameter configuration as directed in the above documentation:

--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

I've also added glue:* permissions to the role that's used to execute my EMR Serverless job. I've checked the AWS Glue console but I don't see the table under Data Catalog tables. My Spark driver logs for my EMR Serverless job, specifically the standard error logs don't show anything regarding Glue. The only Hive-related log I see is:

INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.

which doesn't look too promising.

However, the EMR Serverless AWS console page for the job shows it recognizes AWS Glue Data Catalog as metastore in the Metastore configuration section.

Enter image description here

So am I doing something wrong or missing something?

asked 12 days ago100 views
2 Answers
1
Accepted Answer

I resolved the issue. Unfortunately, the AWS documentation is missing a configuration setting. The metastore configuration documentation is here:

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html#glue-metastore

In the Spark tab, the documentation is missing the following configuration setting to ensure that the catalog implementation is not using the in-memory Derby db which it defaults to if this configuration setting is missing:

--conf spark.sql.catalogImplementation=hive

Combined with the other documentation settings, I see the Delta table now show up on the AWS Glue Data Catalog. Glue Data Catalog seems to have trouble correctly parsing the schema, but that is a question for another post.

Please update the AWS EMR Serverless documentation to include this configuration setting. Thanks!

answered 10 days ago
AWS
SUPPORT ENGINEER
reviewed 9 days ago
  • Thats good catch. catalogImplementation should not be "in-memory" in this case.

1

Hello,

As you updating Delta table that uses Glue catalog, may I ask you to test the below sample and let me know the outcome,

  1. Configure your Spark session.

Configure the Spark Session. Set up Spark SQL extensions to use Delta lake.

%%configure -f
{
    "conf": {
        "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        "spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}

  1. Create a Delta lake Table

We will create a Spark Dataframe with sample data and write this into a Delta lake table. NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.

tableName = "delta_table"
basePath = "s3://my_bucket/aws_workshop/delta_data_location/" + tableName

data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

data.write.format("delta"). \
  save(basePath)
  1. Query the table

We will read the table using spark.read into a Spark dataframe

df = spark.read.format("delta").load(basePath)
df.show()
AWS
SUPPORT ENGINEER
answered 12 days ago
  • Hi Yokesh,

    Thanks for the reply. Let me clarify my issue.

    In the past, I have been able to save and load Delta tables on EMR Serverless and on localhost. My issue is when I added AWS Glue Data Catalog as a metastore (by specifying the Spark configuration parameters), the data catalog tables are not populated on AWS Glue even though the EMR Serverless job still runs fine.

    The suggestions you have above work for me - my code is very similar to this. But again, the Glue Data Catalog is not updated.

  • Hi There, It looks strange :-) If the EMR-S able write & read the delta table without any issues, then metadat would be persisted in Glue catalog. Just to confirm/in case not verified, could you please make sure the database referred by the table is exist and holds appropriate permission to list your end. You can run below command to check if they are visible,

    spark.sql('show databases').show()
    spark.sql('show tables from <Your database>').show()
    

    If they are visible here, the issue likely to be relies on how you describe the object. If not, then you can try enable debug log mode on your spark job and make sure they are writing/reading the appropriate table. If you find the issue after checking above pointers, please feel free to reach us via AWS Support for more assistance.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions