Improve COPY ingestion performance for large data loads on Amazon Redshift

3 minute read
Content level: Expert
0

Amazon Redshift tables with a Sort Key defined may encounter COPY command execution times grow non-linearly relative to the data size. This article proposes alternative to improve the ingestion performance after the Amazon Redshift best practices for loading data have been already followed.

Introduction

Amazon Redshift is an MPP, massively parallel processing architecture. When you create a table and load data into it, Redshift distributes the table data amongst the various compute nodes based on the Distribution Style. When data is being written to disk, it needs to be sorted on each compute node based on the Sort Key defined. You might not have manually set the distribution style or the sort key, and automatic table optimization (ATO) might have identified one for you.

Problem Statement

Data loads into Amazon Redshift tables that have a Sort Key are resource intensive, requiring the system to consume large amounts of memory to sort the data as part of the COPY operation. For very large loads, you may observe the COPY commands performance deteriorate as the load data size increases, due to the larger amounts of memory needed to sort the larger data sets. And instead of single large multi-TB operations, breaking down the load into chunks could yield faster overall ingestion.

Note: It is expected that you have reviewed the Amazon Redshift best practices for loading data and followed these recommendations already.

Solution

The ManifestGenerator utility script, https://github.com/awslabs/amazon-redshift-utils/tree/master/src/ManifestGenerator, a part of the AWS Labs amazon-redshift-utils repo, can be leveraged to generate Redshift manifest files to be used for COPY command to split the ingestion into multiple batches. Note that this utility does not determine the optimal number of split batches, and you will need to provide the input num for the utility to provide you those many manifest files. It will take into account the total size of data being ingested, not the number of files being ingested, and will try to split the ingestion volume equally amongst the individual batches.

Results

When testing the COPY performance for a 36 TB data load of 334 Billion rows into Redshift table with single column used for distribution key as well as sort key, we found that a single copy operation took ~48 hours, running two sequential copy commands took ~35 hours, running four sequential copy commands took ~27 hours, and running eight sequential copy commands took ~26 hours. Note that COPY commands to the same target table get serialized anyways, and thus running multiple COPY commands against same table will not yield any faster performance.

Call to Action

For loading large data sets you can benchmark with the single COPY command, and use this utility to experiment with 2/4/8/n batches. Compare the total time and determine the right number of batches that meet your performance needs.

profile pictureAWS
EXPERT
published a month ago646 views