How do I efficiently load data from Amazon S3 into Amazon Redshift?

3 minute read
0

I want to load large volumes of data from Amazon Simple Storage Service (Amazon S3) into Amazon Redshift and maintain optimal performance.

Short description

To add data from Amazon S3 to your Amazon Redshift tables, You can use the INSERT or COPY command. However, It's a best practice to use the COPY command because it's more efficient. COPY uses massively parallel processing (MPP) architecture to simultaneously load data from files in an S3 bucket.

For other methods to use, see Loading data in Amazon Redshift.

Resolution

Run one COPY command

Run only one COPY command to load one table. If you use multiple concurrent COPY commands to load one table from multiple files, then Amazon Redshift performs a serialized load and might slow down.

Prepare data files

To efficiently use the COPY command, you must also prepare the data files. The number of files and their size affects parallel processing capabilities.

If you specify multiple files in the COPY command, then Amazon Redshift loads the data in parallel. To prepare your data files, make sure that they are approximately the same size, with a range from 1 MB to 1 GB.

For a Redshift provisioned cluster, it's a best practice to use a number of files that's a multiple of the number of slices. For Amazon Redshift Serverless, it's a best practice to use a number files that's a multiple of 128. For more information, see Loading data files.

To get the number of slices for a Redshift provisioned cluster, run the following query:

select count(*) from STV_SLICES where type = 'D';

Note: When you specify one uncompressed .csv, Parquet, or ORC file in the COPY command, Amazon Redshift splits files that are larger than 128 MB.

Monitor the COPY command status

To view the COPY command status, you can check the system tables.

The system tables record the COPY command details, including the number of files, the file size, and the number of loaded rows. You can monitor the tables to determine performance improvements that you can make and the cause of a COPY command failure. 

For more information about the COPY command, see SYS_LOAD_HISTORY and SYS_LOAD_DETAIL.

For more information about COPY command errors, see SYS_LOAD_ERROR_DETAIL.

Related information

Amazon Redshift best practices for loading data

Loading tables with the COPY command

Loading data from compressed and uncompressed files

AWS OFFICIAL
AWS OFFICIALUpdated 12 days ago