Skip to content

AWS Glue crawler creating new version glue data catalog tables when no change in schema

0

Hi all,

I hope you’re doing well. I have a question about AWS Glue crawlers, specifically regarding how they handle the generation of new versions for tables in the Glue Data Catalog.

Currently, our Glue crawlers are scheduled to run weekly to detect and crawl for schema changes (e.g., new columns or tables and also removed columns) in databases like PostgreSQL and Redshift. However, we’ve noticed that the crawler is creating new versions of tables in the Glue Data Catalog even when there are no actual changes to the table schema. The only difference in the new versions is the “last updated” property of the table. Changes when comparing versions

For context, these are the advanced options we have configured for the crawler:

  • When the crawler detects schema changes in the data store, it is set to “Update the table definition in the data catalog.”
  • We have not enabled the option to “Update all new and existing partitions with metadata from the table.”
  • For handling deleted objects in the data store, the crawler is set to “Mark the table as deprecated in the data catalog.”

Advanced options screenshots

Is there a way to prevent Glue crawlers from generating new table versions unless there are actual schema changes?

Thanks in advance for your insights!

asked a year ago735 views
1 Answer
1

The behavior you're experiencing with AWS Glue crawlers creating new versions of tables in the Glue Data Catalog even when there are no schema changes is a known characteristic of how crawlers operate. By default, when a crawler runs, it will update the "last updated" timestamp of the table metadata, which results in a new version being created in the Data Catalog.

However, there are a few approaches you can consider to minimize unnecessary version creation:

  1. Use incremental crawls: Instead of running full crawls weekly, you can configure your crawler to perform incremental crawls. This way, the crawler will only process new or modified data, potentially reducing the likelihood of creating new versions when no changes have occurred.

  2. Adjust the crawler's update behavior: In the crawler's configuration, under "Catalog options," you can modify the update behavior. Instead of "Update the table definition in the data catalog," you could choose "Add new columns only." This option will only update the table if new columns are detected, which may reduce unnecessary version creation.

  3. Custom ETL job: If you need more fine-grained control, you could create a custom AWS Glue ETL job that checks for schema changes before updating the Data Catalog. This approach would allow you to implement your own logic for when to create new versions.

  4. Use the Schema Registry: For certain types of data, you might benefit from using the AWS Glue Schema Registry. This can help manage schema evolution more precisely, though it's primarily designed for streaming data.

  5. Regular cleanup: If the accumulation of versions is a concern, you could implement a process to periodically clean up old versions that are no longer needed.

It's important to note that while these approaches can help reduce unnecessary version creation, they may not completely eliminate it. The creation of new versions, even with only timestamp changes, is part of AWS Glue's design to maintain a history of metadata changes. This can be beneficial for auditing and tracking purposes, but it's understandable if it doesn't align with your specific use case.

If this behavior significantly impacts your workflows, you may want to consider reaching out to AWS support for further guidance or to suggest this as a feature request for more granular control over version creation in the Glue Data Catalog.
Sources
AWS Glue Catalog Data Tables | AWS re:Post
CatalogSchemaChangePolicy - AWS Glue
AWS Glue Data Catalog - AWS Prescriptive Guidance
AWS Glue Data Catalog best practices - AWS Glue
Configure a crawler to handle schema changes | AWS re:Post

answered a year ago
EXPERT
reviewed a year ago
  • Incremental crawls will still create new versions right - I am mainly using it for schema tracking.

    I cannot use "Add new columns only." since i want to track table deletions as well.

    schema registry does not work for the connectors i want to use.

    any other options?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.