AWS DMS configuration to update AWS Glue Catalog with schema changes

0

Hello,

We set up AWS DMS, where the source is MS SQL Server 2019, and the target is S3 (with parquet). Setting up CDC copying. And it is important for us to check that DDLs on source work as well:

  1. DMS creates new files with new structure (disclaimer: it does);
  2. we could track the changes in Glue Catalog.

If we do not use the "GlueCatalogGeneration": true flag, then everything works great on a file level, in the sense that after changing the schema in the source, new files in S3 will also be added with the new structure. Everything is especially wonderful if columns are added. Then we can simply set up the crawler at some frequency and enjoy how the scheme is updated in our catalog.

Everything is not so rosy if we delete the column: there is no column in the new files - here DMS does an excellent job as well. But how can we make our catalog know that the schema has changed? There may still be files with the previous scheme in the data folder and the crawler will simply ignore the fact that the scheme has changed (correct me if we can somehow configure the crawler to analyze only new files, and not everything in the folder). The situation is approximately the same with changing the data type of a column - the DMS files are generated correctly, but again the crawler will simply ignore the changes, since the folder also contains files with the previous type. Of course, the question may arise of how to read files with different data types for one column, but if this is not a case now, the first task is to catch schema changes.

It seems like the "GlueCatalogGeneration": true setting should have helped with this. And it copes with the first creation of a table in the directory during initialization. But DDL changes in source break everything! I can give you a number of logs if you can tell me which are the most relevant (since I see a whole set of very indirectly related errors, such as (in any case, after first record with any new structure should be applied to the target - migration task fails

...
AWS_CPP_SDK: <EC2MetadataClient:ERROR>: Can not retrieve resource from http://xxx.xxx.xxx.xxx/latest/meta-data/placement/availability-zone (AWS_SDK_CPP :55)
...  
Not retriable error: <InvalidRequestException> line 1:8: mismatched input 'DATABASE'. Expecting: 'MATERIALIZED', 'MULTI', 'OR', 'PROTECTED', 'ROLE', 'SCHEMA', 'TABLE', 'VIEW' [1001730] (anw_retry_strategy.cpp:118)
...
AWS_CPP_SDK: <EC2MetadataClient:ERROR>: Http request to retrieve credentials failed (AWS_SDK_CPP :55)
...

).

Now we cannot check whether DMS itself can update the schema in AWS Glue using the "GlueCatalogGeneration": true setting.

asked 21 days ago229 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions