By using AWS re:Post, you agree to the AWS re:Post Terms of Use

New column added to s3 dataset visible in Athena but not in Glue job while using create_dynamic_frame.from_catalog...

0

I have a s3 dataset that is cataloged in Glue. A new column was added to the s3 data and the crawler was re-crawled that updated the glue table definition to add the new column. This new column is visible on Athena console for querying. However, the column is not visible in Glue Job while using the glueContext.create_dynamic_frame.from_catalog. I even added to the option - "refreshSchema": "true" to Force refresh the schema cache but it's still not visible from the Glue Job. What could be the issue?

asked 16 days ago28 views
1 Answer
0

This issue could be related to how AWS Glue caches schema information and how it interacts with the Data Catalog. Even though you've updated the S3 dataset and re-crawled it, updating the Glue table definition, there might be a delay or caching issue preventing the Glue job from seeing the new column immediately.

Here are a few things you can try to resolve this issue:

  1. Ensure that the Glue job is using the most up-to-date version of the table metadata. Sometimes, there can be a delay in propagating changes across AWS services.

  2. Try running the Glue job with a new job run. Sometimes, subsequent runs of the same job might pick up the updated schema.

  3. Instead of using "refreshSchema": "true", you could try to explicitly specify the schema in your Glue job. This would force the job to use the schema you provide rather than relying on the cached version.

  4. Check if there are any partitions in your data. If your data is partitioned, you might need to update partition metadata separately.

  5. Verify that the IAM role associated with your Glue job has the necessary permissions to access the updated table metadata in the Glue Data Catalog.

  6. As a workaround, you could try reading the data directly from S3 using create_dynamic_frame.from_options() instead of from_catalog(). This bypasses the Glue Data Catalog and reads the schema directly from the data.

  7. If none of the above works, you might want to consider dropping and recreating the table in the Glue Data Catalog, then re-running the crawler. This should ensure that the table definition is completely refreshed.

If the problem persists after trying these solutions, it might be worth contacting AWS support for further assistance, as there could be an underlying issue with the Glue service or how it's interacting with your specific data setup.
Sources
Community | Unleashing Data Analytics on S3 Data lake with AWS Glue Crawler and Amazon Athena
Using crawlers to populate the Data Catalog - AWS Glue

profile picture
answered 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions