AWS Glue Data Catalog cannot determine Delta table classification

0

I'm running an EMR Serverless Spark job that uses Delta OSS to handle Delta tables. I previously resolved a configuration issue with EMR Serverless and AWS Glue Data Catalog here.

Although the appropriate Glue database and table are created, Glue cannot determine the classification of the data correctly. It should classify the data as "delta", but no classification is given. The parsed schema is also effectively empty - it has the type of array. I can also see the there are no objects under the spark.sql.warehouse.dir directory although some empty s3 folders are created.

My EMR Serverless logs don't report any errors from Glue.

Interestingly, it determines the data as "delta" if I manually create my own Glue crawler to crawl the exact same data.

Why can't it determine the data are "delta" and why isn't the schema recognized? Is it related to this issue here?

asked 9 days ago212 views
1 Answer
1

It sounds like you're encountering an issue with AWS Glue not properly classifying Delta tables when using EMR Serverless Spark jobs. Here are some steps and considerations to help diagnose and resolve this issue:

  • Check IAM Permissions:
  • Ensure that the IAM role used by your EMR Serverless Spark job has the necessary permissions to interact with AWS Glue and access the S3 location where your Delta tables are stored. The role should have permissions for Glue CreateDatabase, CreateTable, GetTable, and UpdateTable actions.
  • Verify Glue Data Catalog Configuration:
  • Double-check the configuration of your EMR Serverless Spark job to ensure that it is correctly configured to use the Glue Data Catalog. This includes specifying the Glue catalog ID and the appropriate database and table names when accessing Delta tables. *** Review Spark Job Code:**
  • Inspect your Spark job code to confirm that you are correctly specifying the Delta format when reading and writing Delta tables.
  • Ensure that you are not inadvertently using a different format or incorrectly specifying the table location. *** Verify Delta Table Metadata:**
  • Check the metadata of your Delta tables directly in the Glue Data Catalog. You can use the AWS Glue console or CLI to inspect the table properties and ensure that they are correctly populated. Look for properties like InputFormat, OutputFormat, SerializationLibrary, and StorageDescriptor.
  • Inspect Glue Data Catalog Schema:
  • If Glue is not correctly inferring the schema for your Delta tables, consider explicitly specifying the schema when reading the data in your Spark job. This can help ensure that Glue recognizes the correct schema during table creation. *** Manually Trigger Crawlers:**
  • If Glue is unable to classify the data correctly during the EMR Serverless job, try manually triggering a Glue crawler after the job completes. This can force Glue to re-crawl the data and update its metadata catalog. *** Investigate Glue Crawler Settings:**
  • Compare the settings used by your EMR Serverless Spark job with those used by your manually created Glue crawler. Ensure that they are consistent in terms of data location, format, and schema inference settings. *** Monitor S3 Object Structure:**
  • Examine the structure of the S3 location where your Delta tables are stored. Ensure that the data files are organized in a way that Glue can efficiently parse and infer schema information. *Please let me know if this helps.

Ismael Murillo

AWS
answered 5 days ago
  • Hi Ismael,

    Thanks for the response. I previously completed the first five bullet points. I'm still getting the error. Before I try the other bullet points, I turned up the logging for AWS resources by setting the logging level for com.amazonaws to debug during the execution of my EMR Serverless job.

    I examined the standard error log and the only unusual log I found was the following:

    DEBUG request: Received error response: com.amazonaws.services.glue.model.EntityNotFoundException: Database global_temp not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException; Request ID: 
    

    Could this be the root cause of my issue? Do I need to create global_temp or should AWS create it automatically. FYI, the role that executes my EMR Serverless job has the following actions permitted against the arn:aws:glue:{self.region}:{self.account}:database/global_temp resource:

    "glue:CreateDatabase",
    "glue:CreateTable",
    "glue:GetDatabase",
    "glue:GetTable",
    "glue:UpdateTable",
    

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions