- Newest
- Most votes
- Most comments
I resolved the issue. Unfortunately, the AWS documentation is missing a configuration setting. The metastore configuration documentation is here:
https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html#glue-metastore
In the Spark tab, the documentation is missing the following configuration setting to ensure that the catalog implementation is not using the in-memory Derby db which it defaults to if this configuration setting is missing:
--conf spark.sql.catalogImplementation=hive
Combined with the other documentation settings, I see the Delta table now show up on the AWS Glue Data Catalog. Glue Data Catalog seems to have trouble correctly parsing the schema, but that is a question for another post.
Please update the AWS EMR Serverless documentation to include this configuration setting. Thanks!
Hello,
As you updating Delta table that uses Glue catalog, may I ask you to test the below sample and let me know the outcome,
- Configure your Spark session.
Configure the Spark Session. Set up Spark SQL extensions to use Delta lake.
%%configure -f
{
"conf": {
"spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
"spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
- Create a Delta lake Table
We will create a Spark Dataframe with sample data and write this into a Delta lake table.
NOTE : You will need to update my_bucket
in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.
tableName = "delta_table"
basePath = "s3://my_bucket/aws_workshop/delta_data_location/" + tableName
data = spark.createDataFrame([
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])
data.write.format("delta"). \
save(basePath)
- Query the table
We will read the table using spark.read into a Spark dataframe
df = spark.read.format("delta").load(basePath)
df.show()
Hi Yokesh,
Thanks for the reply. Let me clarify my issue.
In the past, I have been able to save and load Delta tables on EMR Serverless and on localhost. My issue is when I added AWS Glue Data Catalog as a metastore (by specifying the Spark configuration parameters), the data catalog tables are not populated on AWS Glue even though the EMR Serverless job still runs fine.
The suggestions you have above work for me - my code is very similar to this. But again, the Glue Data Catalog is not updated.
Hi There, It looks strange :-) If the EMR-S able write & read the delta table without any issues, then metadat would be persisted in Glue catalog. Just to confirm/in case not verified, could you please make sure the database referred by the table is exist and holds appropriate permission to list your end. You can run below command to check if they are visible,
spark.sql('show databases').show() spark.sql('show tables from <Your database>').show()
If they are visible here, the issue likely to be relies on how you describe the object. If not, then you can try enable debug log mode on your spark job and make sure they are writing/reading the appropriate table. If you find the issue after checking above pointers, please feel free to reach us via AWS Support for more assistance.
Relevant content
- asked 3 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
Thats good catch. catalogImplementation should not be "in-memory" in this case.