How can I control the number of CDC files being generated for my target S3 endpoint using AWS DMS?
4 minute read
I want to control the number of change data capture (CDC) files generated when I use Amazon Simple Storage Service (Amazon S3) as a target endpoint. How can I use AWS Database Migration Service (AWS DMS) to do this?
When using Amazon S3 as a target endpoint, you can use a number of parameters to control the associated file size in the target endpoint. This includes using Amazon S3 as a target endpoint for a full load and CDC, or a CDC-only AWS DMS task.
This article discusses the following extra connection attributes (ECAs). Additionally, it covers how to use them to control the volume of CDC files generated on your Amazon S3 endpoint:
cdcMaxBatchInterval - The maximum interval length condition, defined in seconds, to output a file to Amazon S3. The default value is 60 seconds.
cdcMinFileSize - The minimum file size condition, defined in KB, to output a file to Amazon S3. The default value is 32000 KB.
maxFileSize - The maximum size, in KB, of any .csv file to be created while migrating to an S3 target during full load. The default value is 1 GB.
WriteBufferSize - The size, in KB, of the in-memory file write buffer used when generating .csv files on the local disk at the AWS DMS replication instance. The default value is 1000 KB.
The cdcMaxBatchInterval parameter controls the time interval for writing files to Amazon S3. When it uses the default value of 60 seconds, AWS DMS writes files into Amazon S3 every minute. Another important parameter is the cdcMinFileSize parameter, which determines the maximum size of the CDC file. When using the default value of 32000 KB, AWS DMS writes into Amazon S3 every time it has 32000 KB of change data.
The cdcMaxBatchInterval and cdcMinFileSize parameters work together. AWS DMS uses whichever parameter value is met first. With the default setup, AWS DMS writes file into Amazon S3 if it has either a minute of pending changes or 32000 KB of data. Depending on which happens first, one of these actions will be completed. Note: AWS DMS maintains the transaction in the same file, so the file size can exceed the cdcMinFileSize and cdcMaxBatchInterval if the transaction is large.
maxFileSize determines the max file size from S3 target output files for both CSV and Parquet formats. But, when writing into .parquet files, AWS DMS writes data in batches:
1. AWS DMS allocates a memory segment of 1024 KB, which is the default size for writeBufferSize.
2. Regardless of the value of maxFileSize, AWS DMS allocates at least one write buffer with a default size of 1 MB.
3. When AWS DMS finishes writing the first batch of data, it compares the current size of data against the maxFileSize. The data is written to a .parquet file in the target S3 bucket if the current size is greater than or equal to maxFileSize.
4. If you set the maxFileSize to 1 MB, then writeBufferSize, with a default value of 1 MB, meets the value of maxFileSize. This is because the condition is already met after one write buffer is allocated. If you want to decrease the overall size of the generated .parquet file, you can decrease the value of writeBufferSize. By setting it to less than 1 MB, the conditional check happens when the size of the data written is less than 1 MB.
Note: The WriteBufferSize parameter settings apply only to .parquet and not to .csv files.