EMR Serverless not populating AWS Glue Catalog

0

I want to use AWS Glue Data Catalog as a metastore. I'm running an EMR Serverless job that inserts and updates data in a Delta Table. I've successfully populated Delta tables on my localhost computer. I'm trying populate the AWS Glue data catalog through my EMR Serverless job. My EMR Serverless job currently runs without error - the only problem is the AWS Glue data catalog is not getting populated.

I've followed the instructions here.

I start my EMR Serverless job via the AWS CLI. I add the following Spark parameter configuration as directed in the above documentation:

--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

I've also added glue:* permissions to the role that's used to execute my EMR Serverless job. I've checked the AWS Glue console but I don't see the table under Data Catalog tables. My Spark driver logs for my EMR Serverless job, specifically the standard error logs don't show anything regarding Glue. The only Hive-related log I see is:

INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.

which doesn't look too promising.

However, the EMR Serverless AWS console page for the job shows it recognizes AWS Glue Data Catalog as metastore in the Metastore configuration section.

Enter image description here

So am I doing something wrong or missing something?

質問済み 2ヶ月前145ビュー
2回答
2

Hello,

As you updating Delta table that uses Glue catalog, may I ask you to test the below sample and let me know the outcome,

  1. Configure your Spark session.

Configure the Spark Session. Set up Spark SQL extensions to use Delta lake.

%%configure -f
{
    "conf": {
        "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        "spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}

  1. Create a Delta lake Table

We will create a Spark Dataframe with sample data and write this into a Delta lake table. NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.

tableName = "delta_table"
basePath = "s3://my_bucket/aws_workshop/delta_data_location/" + tableName

data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

data.write.format("delta"). \
  save(basePath)
  1. Query the table

We will read the table using spark.read into a Spark dataframe

df = spark.read.format("delta").load(basePath)
df.show()
AWS
サポートエンジニア
回答済み 2ヶ月前
  • Hi Yokesh,

    Thanks for the reply. Let me clarify my issue.

    In the past, I have been able to save and load Delta tables on EMR Serverless and on localhost. My issue is when I added AWS Glue Data Catalog as a metastore (by specifying the Spark configuration parameters), the data catalog tables are not populated on AWS Glue even though the EMR Serverless job still runs fine.

    The suggestions you have above work for me - my code is very similar to this. But again, the Glue Data Catalog is not updated.

  • Hi There, It looks strange :-) If the EMR-S able write & read the delta table without any issues, then metadat would be persisted in Glue catalog. Just to confirm/in case not verified, could you please make sure the database referred by the table is exist and holds appropriate permission to list your end. You can run below command to check if they are visible,

    spark.sql('show databases').show()
    spark.sql('show tables from <Your database>').show()
    

    If they are visible here, the issue likely to be relies on how you describe the object. If not, then you can try enable debug log mode on your spark job and make sure they are writing/reading the appropriate table. If you find the issue after checking above pointers, please feel free to reach us via AWS Support for more assistance.

1
承認された回答

I resolved the issue. Unfortunately, the AWS documentation is missing a configuration setting. The metastore configuration documentation is here:

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html#glue-metastore

In the Spark tab, the documentation is missing the following configuration setting to ensure that the catalog implementation is not using the in-memory Derby db which it defaults to if this configuration setting is missing:

--conf spark.sql.catalogImplementation=hive

Combined with the other documentation settings, I see the Delta table now show up on the AWS Glue Data Catalog. Glue Data Catalog seems to have trouble correctly parsing the schema, but that is a question for another post.

Please update the AWS EMR Serverless documentation to include this configuration setting. Thanks!

回答済み 2ヶ月前
AWS
サポートエンジニア
レビュー済み 2ヶ月前
  • Thats good catch. catalogImplementation should not be "in-memory" in this case.

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ