Redshift ran out of storage during data Copy

0

Customer had 5000 parquet/snappy files in S3 - total size 500GB, the copy command error out after 9 hrs since Redshift ran out of 5 TB space just for this table. Dist is not specified so AUTO should kick in and eventually due to the table size EVEN should kick in. Is Redshift uncompressing all this data inside its storage first before applying encoding? I can’t image why 10x storage got chewed up

AWS
エキスパート
Behram
質問済み 3年前776ビュー
1回答
0
承認された回答

Yes, it is possible. Briefly when using DISTSTYLE ALL (as in the initial stage of AUTO), Redshift needs to make a replica of the table on the 1st slice of each node (NOT on all slices). If you have 2 nodes, the space will be 2x, etc. Yes, during COPY, all the data files need to be uncompressed, columns need to be encoded/compressed (using temp space) and often sorted (more temp space) before they can be written out to disks. Assuming a Parquet/Snappy compression ratio of 3, this is an extra 3x factor, etc.

It is useful to keep a key metric in mind: A single Redshift slice can easily load >5MB/sec/slice or ~18 GB/sec/slice (regardless of node type). Data size is RAW! For large node types with 16 slices, you should expect ~200 GB/node at a minimum. If your COPY time becomes significantly longer (than expected), it is a good indication that something bad (like space problem) has occurred. Here is a useful query that you can use to help troubleshooting "DISK FULL" issues:

select '2000-01-01'::timestamp + (currenttime/1000000.0)* interval '1 second' as currenttime,node_num,query_id,temp_blocks from pg_catalog.stl_disk_full_diag;

Recommendations:

  1. Change DISTSTYLE to EVEN or KEY (if a good DISTKEY can be found). Starting with EVEN is fine.

  2. Load a small set of files into the target table. Objective is to get a better idea of snappy compression and also performance.

  3. To speed things up, use a bigger cluster (Classic Resize). Another quicker option is to Elastic Resize (ER) 2x of your cluster. When done and validated, you can ER back.

Two useful references:

  1. Loading data from Amazon S3: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html

  2. How can I troubleshoot high or full disk usage with Amazon Redshift?: https://aws.amazon.com/premiumsupport/knowledge-center/redshift-high-disk-usage/

エキスパート
回答済み 3年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン