1 Answer
- Newest
- Most votes
- Most comments
1
It sounds like you're encountering an issue with AWS Glue not properly classifying Delta tables when using EMR Serverless Spark jobs. Here are some steps and considerations to help diagnose and resolve this issue:
- Check IAM Permissions:
- Ensure that the IAM role used by your EMR Serverless Spark job has the necessary permissions to interact with AWS Glue and access the S3 location where your Delta tables are stored. The role should have permissions for Glue CreateDatabase, CreateTable, GetTable, and UpdateTable actions.
- Verify Glue Data Catalog Configuration:
- Double-check the configuration of your EMR Serverless Spark job to ensure that it is correctly configured to use the Glue Data Catalog. This includes specifying the Glue catalog ID and the appropriate database and table names when accessing Delta tables. *** Review Spark Job Code:**
- Inspect your Spark job code to confirm that you are correctly specifying the Delta format when reading and writing Delta tables.
- Ensure that you are not inadvertently using a different format or incorrectly specifying the table location. *** Verify Delta Table Metadata:**
- Check the metadata of your Delta tables directly in the Glue Data Catalog. You can use the AWS Glue console or CLI to inspect the table properties and ensure that they are correctly populated. Look for properties like InputFormat, OutputFormat, SerializationLibrary, and StorageDescriptor.
- Inspect Glue Data Catalog Schema:
- If Glue is not correctly inferring the schema for your Delta tables, consider explicitly specifying the schema when reading the data in your Spark job. This can help ensure that Glue recognizes the correct schema during table creation. *** Manually Trigger Crawlers:**
- If Glue is unable to classify the data correctly during the EMR Serverless job, try manually triggering a Glue crawler after the job completes. This can force Glue to re-crawl the data and update its metadata catalog. *** Investigate Glue Crawler Settings:**
- Compare the settings used by your EMR Serverless Spark job with those used by your manually created Glue crawler. Ensure that they are consistent in terms of data location, format, and schema inference settings. *** Monitor S3 Object Structure:**
- Examine the structure of the S3 location where your Delta tables are stored. Ensure that the data files are organized in a way that Glue can efficiently parse and infer schema information. *Please let me know if this helps.
Ismael Murillo
answered 9 months ago
Relevant content
- asked 6 months ago
- asked a year ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Hi Ismael,
Thanks for the response. I previously completed the first five bullet points. I'm still getting the error. Before I try the other bullet points, I turned up the logging for AWS resources by setting the logging level for com.amazonaws to debug during the execution of my EMR Serverless job.
I examined the standard error log and the only unusual log I found was the following:
Could this be the root cause of my issue? Do I need to create global_temp or should AWS create it automatically. FYI, the role that executes my EMR Serverless job has the following actions permitted against the arn:aws:glue:{self.region}:{self.account}:database/global_temp resource:
Hi Ismael,
Is this an already known issue as described here?:
https://repost.aws/questions/QUToXdoBgjTgiGMRFxcsY_3A/schema-incorrectly-showing-data-type-of-array-in-glue-catalog-when-using-delta-lake-table