1 Answer
- Newest
- Most votes
- Most comments
0
- How Glue detects changes: AWS Glue jobs read the source data directly during ETL job execution. This allows Glue to detect discrepancies between the actual data structure and the catalog metadata, even without running a crawler.
- Dynamic Frame feature: Glue's Dynamic Frame feature is designed to automatically handle schema changes. While this provides flexibility, it can lead to undesired modifications in your case.
- Default behavior: By default, Glue is configured to allow schema changes, which is intended to provide flexibility in data pipelines but may not align with your specific requirements.
To address this issue, consider the following approaches:
- Enable "Job bookmarks" in your Glue job settings. This helps prevent reprocessing of already processed data.
- Adjust crawler settings to prevent schema changes. Use the "Prevent the Crawler from changing an existing schema".
- Create and add a custom classifier to your crawler for more fine-grained control over schema detection.
- In your ETL scripts, use the enableUpdateCatalog and updateBehavior options to control catalog update behavior.
- Implement a separate monitoring system to detect schema changes and send alerts. You can use AWS CloudWatch or Amazon EventBridge for this purpose.
- Consider using the solution outlined in the AWS Big Data Blog, which involves creating an AWS Glue ETL job to compare table schema versions and notify changes via Amazon SNS.
By implementing one or more of these strategies, you can better control schema changes, prevent unexpected modifications to your Redshift tables, and receive appropriate notifications when changes occur.
Please reply if the provided answer is different from the actual function.
answered a month ago
Relevant content
- asked 2 years ago
- asked a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
This is the exact same GenIA answer I got from Amazon Q. If you read carefully, most parts are not relevant to the question. Using bookmarks won't prevent Glue from altering the tables, nor does a custom classifier. Also, I specifically stated I'm not using a crawler on a schedule. Just once for catalog purposes. I'd love a link to the AWS Big Data Blog, mentioned in point 6.