Amazon Redshift tables with a Sort Key defined may encounter COPY command execution times grow non-linearly relative to the data size. This article proposes alternative to improve the ingestion performance after the Amazon Redshift best practices for loading data have been already followed.
Introduction
Amazon Redshift is an MPP, massively parallel processing architecture. When you create a table and load data into it, Redshift distributes the table data amongst the various compute nodes based on the Distribution Style. When data is being written to disk, it needs to be sorted on each compute node based on the Sort Key defined. You might not have manually set the distribution style or the sort key, and automatic table optimization (ATO
) might have identified one for you.
Problem Statement
Data loads into Amazon Redshift tables that have a Sort Key are resource intensive, requiring the system to consume large amounts of memory to sort the data as part of the COPY
operation. For very large loads, you may observe the COPY
commands performance deteriorate as the load data size increases, due to the larger amounts of memory needed to sort the larger data sets. And instead of single large multi-TB operations, breaking down the load into chunks could yield faster overall ingestion.
Note: It is expected that you have reviewed the Amazon Redshift best practices for loading data and followed these recommendations already.
Solution
The ManifestGenerator
utility script, https://github.com/awslabs/amazon-redshift-utils/tree/master/src/ManifestGenerator, a part of the AWS Labs amazon-redshift-utils
repo, can be leveraged to generate Redshift manifest files to be used for COPY
command to split the ingestion into multiple batches. Note that this utility does not determine the optimal number of split batches, and you will need to provide the input num
for the utility to provide you those many manifest files. It will take into account the total size of data being ingested, not the number of files being ingested, and will try to split the ingestion volume equally amongst the individual batches.
Results
When testing the COPY
performance for a 36 TB
data load of 334 Billion
rows into Redshift table with single column used for distribution key as well as sort key, we found that a single copy operation took ~48 hours, running two sequential copy commands took ~35 hours, running four sequential copy commands took ~27 hours, and running eight sequential copy commands took ~26 hours. Note that COPY
commands to the same target table get serialized anyways, and thus running multiple COPY
commands against same table will not yield any faster performance.
Call to Action
For loading large data sets you can benchmark with the single COPY
command, and use this utility to experiment with 2/4/8/n
batches. Compare the total time and determine the right number of batches that meet your performance needs.