By using AWS re:Post, you agree to the AWS re:Post Terms of Use

AWS Glue alters Redshift tables schema without prior notice

0

I work with Amazon Web Services as a data engineer. I have created several Glue jobs that copy data from RDS Aurora MySQL to Amazon Redshift using the Glue catalog, Visual ETL, and ETL Scripts. The problem I'm facing is that if a column is added or modified in a source table, even if this change is not reflected in the catalog, the destination table in Redshift is altered, for example, adding a new column with a changed data type, or adding a new column not available in the original schema. I thought that by not crawling the source tables this issue would be avoided, but somehow Glue detects the change and applies it to the destination without prior notice. I would prefer the job to fail as an alert. How does Glue "know" that a source table has changed without running a crawler?

1 Answer
0
  1. How Glue detects changes: AWS Glue jobs read the source data directly during ETL job execution. This allows Glue to detect discrepancies between the actual data structure and the catalog metadata, even without running a crawler.
  2. Dynamic Frame feature: Glue's Dynamic Frame feature is designed to automatically handle schema changes. While this provides flexibility, it can lead to undesired modifications in your case.
  3. Default behavior: By default, Glue is configured to allow schema changes, which is intended to provide flexibility in data pipelines but may not align with your specific requirements.

To address this issue, consider the following approaches:

  1. Enable "Job bookmarks" in your Glue job settings. This helps prevent reprocessing of already processed data.
  2. Adjust crawler settings to prevent schema changes. Use the "Prevent the Crawler from changing an existing schema".
  3. Create and add a custom classifier to your crawler for more fine-grained control over schema detection.
  4. In your ETL scripts, use the enableUpdateCatalog and updateBehavior options to control catalog update behavior.
  5. Implement a separate monitoring system to detect schema changes and send alerts. You can use AWS CloudWatch or Amazon EventBridge for this purpose.
  6. Consider using the solution outlined in the AWS Big Data Blog, which involves creating an AWS Glue ETL job to compare table schema versions and notify changes via Amazon SNS.

By implementing one or more of these strategies, you can better control schema changes, prevent unexpected modifications to your Redshift tables, and receive appropriate notifications when changes occur.

Please reply if the provided answer is different from the actual function.

answered a month ago
  • This is the exact same GenIA answer I got from Amazon Q. If you read carefully, most parts are not relevant to the question. Using bookmarks won't prevent Glue from altering the tables, nor does a custom classifier. Also, I specifically stated I'm not using a crawler on a schedule. Just once for catalog purposes. I'd love a link to the AWS Big Data Blog, mentioned in point 6.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions