Skip to content

How do I troubleshoot high or full disk usage with Amazon Redshift?

7 minute read
0

I experienced high or full disk utilization on Amazon Redshift and want to troubleshoot this issue.

Resolution

To resolve high or disk utilization issues for Amazon Redshift, follow these troubleshooting steps.

Distribution and sort key

Review the table's distribution style, distribution key, and sort key selection. Tables with distribution skew might cause a full disk node. If you have tables with skewed distribution styles, then change the distribution style to a more uniform distribution. The distribution and row skew might affect storage skew and intermediate rowset when a query runs.

To get the cardinality of your distribution key, run the following query:

SELECT <distkey column>, COUNT(*) 
 FROM <schema name>.<table with distribution skew> 
 GROUP BY <distkey column> 
 HAVING COUNT(*) > 1 
 ORDER BY 2 DESC;

Note: Replace distkey column, schema name, and table with distribution skew with your tables variables.

To avoid a sort step, use SORT KEY columns in your ORDER BY clause. A sort step might use excessive memory and cause a disk spill. For more information, see Sort keys.

In the filtered result set, choose a column with high cardinality to view its data distribution. For more information, see Choose the best distribution style.

Query processing

Review any allocated query memory. Intermediate query results can store in temporary blocks when queries process. If there isn't enough free memory, then the tables cause a disk spill. Intermediate result sets don't compress which affects the available disk space. For more information, see Insufficient memory allocated to the query.

Amazon Redshift defaults to a table structure with even distribution and no column encoding for temporary tables. If you use SELECT...INTO syntax, then use a CREATE statement. For more information, see Tip #6 in Top 10 performance tuning techniques for Amazon Redshift.

If insufficient memory is allocated to your query, then you might see a step in SVL_QUERY_SUMMARY where is_diskbased shows the value true. The following query identifies the top 20 disk spill queries for a specified time:

SELECT q.userid, q.query, q.starttime, q.endtime, m.query_temp_blocks_to_disk, btrim(querytxt)  
 FROM stl_query q JOIN SVL_QUERY_METRICS_SUMMARY m ON m.query = q.query 
 WHERE m.query_temp_blocks_to_disk > 0 
   AND starttime BETWEEN '2025-01-01 00:00:00' AND '2025-01-02 00:00:00' 
 ORDER BY m.query_temp_blocks_to_disk DESC 
 LIMIT 20;

To resolve this issue, increase the number of query slots to allocate more memory to the query. 

If you notice a sudden spike in utilization, then run the following query to identify the top 20 disk spill queries:

SELECT a.userid, a.query, a.blocks_to_disk, trim(b.text) as text 
 FROM stv_query_metrics a, stv_inflight b 
 WHERE a.query = b.query 
   AND a.segment = -1 
   AND a.step_type = -1 
   AND a.max_blocks_to_disk > 0 
 ORDER BY 3 DESC
 LIMIT 20;

In the output, query view the column value blocks_to_disk to identify disk spills. Terminate queries that spill too much and then allocate more memory to the queries before you run them again.

You can also use WLM query monitoring rules to counter heavy process loads and to identify I/O intensive queries.

Tables with VARCHAR(MAX) columns

Check VARCHAR or CHARACTER VARYING columns for trailing blanks that might be omitted when data is stored on the disk. When queries process, trailing blanks can occupy the full length in memory. The maximum value for VARCHAR and CHARACTER VARYING is 65535 bytes. It's a best practice to use the smallest possible column size.

To generate a list of tables with maximum column widths, run the following query:

SELECT database, schema || '.' || "table" AS "table", max_varchar 
 FROM svv_table_info 
 WHERE max_varchar > 150 ORDER BY 2;

Run the following query to identify and display the true widths of the wide VARCHAR table columns:

SELECT max(octet_length (rtrim(column_name))) FROM table_name;

In the output from this query, confirm that the length fits your use case. If the columns are at maximum length and exceed your needs, then adjust their length to the minimum size needed.

For more information, see Amazon Redshift best practices for designing tables.

High column compression

To encode all columns except sort key, use the ANALYZE COMPRESSION or automatic table optimization. It's a best practice to use column encoding.

Maintenance operations

Be sure to regularly analyze and vacuum the database tables in your Amazon Redshift database. Identify any queries that run against tables that are missing statistics. Then, prevent queries from running against tables that are missing statistics so that Amazon Redshift doesn't scan unnecessary table rows.

Note: Maintenance operations such as VACUUM and DEEP COPY use temporary storage space for their sort operations and might cause a spike in disk usage.

For example, the following query identifies outdated stats in Amazon Redshift:

SELECT table_id, database, schema, "table", stats_off, size
 FROM svv_table_info 
 WHERE stats_off > 10 
 ORDER BY size DESC;

Additionally, use the ANALYZE command to view and analyze table statistics.

Cartesian products with cross-joins

Use the EXPLAIN plan of the query to look for queries with Cartesian products. Cartesian products are cross-joins that are unrelated and might increase block numbers. These cross-joins might result in higher memory utilization and more tables spilled to disk. If cross-joins don't share a JOIN condition, then the joins produce a Cartesian product of two tables. Every row of one table joins to every row of the other table.

Cross-joins can also run as nested loop joins and cause long process times. Nested loop joins result in spikes in overall disk usage. For more information, see Identifying queries with nested loops.

Minimum table size

Individual tables might have different sizes in different clusters. The minimum table size is determined by the number of columns and whether the table has a SORTKEY and number of slices populated. If you recently resized an Amazon Redshift cluster, then you might see a change in your overall disk storage caused by the change of slices. Amazon Redshift also counts the table segments used by each table. For more information, see Why does a table in an Amazon Redshift cluster consume more or less disk storage space than expected?

Tombstone blocks

Tombstone blocks generate when a WRITE transaction to an Amazon Redshift table occurs and there's a concurrent read operation. Amazon Redshift keeps the blocks before the write operation to keep a concurrent read operation consistent. You can't change Amazon Redshift blocks. Every Insert, Update, or Delete action creates a new set of blocks, and marks the old blocks as tombstoned.

Sometimes tombstones fail to clear at the commit stage because of long-running table transactions. Tombstones can also fail to clear when there are too many ETL loads that run at the same time. Because Amazon Redshift monitors the database from the time that the transaction starts, any table written to the database also retains the tombstone blocks. If long-running table transactions occur regularly and across several loads, enough tombstones can accumulate to result in a "Disk Full" error.

If there are long-running queries that are active, then run the commit command to terminate the queries and release all subsequent blocks:

begin;
create table a (id int);
insert into a values(1);
commit;
drop table a;

Run the following query to confirm tombstone blocks:

SELECT trim(name) as tablename, count(case when tombstone > 0 then 1 else null end) as tombstones 
 FROM svv_diskusage 
 GROUP BY 1 
 HAVING count(case when tombstone > 0 then 1 else null end) > 0 
 ORDER BY 2 DESC;

Copy a large file

A COPY operation might receive a "Disk Full error" even if there's enough storage available. This error occurs if the sort operation spills to disk and creates temporary blocks.

If you encounter a "Disk Full" error, then check the STL_DISK_FULL_DIAG table. Check the temporary blocks and which query ID caused the error:

SELECT 
  '2000-01-01'::timestamp + (currenttime/1000000.0)* interval '1 second' as currenttime,
  node_num,
  query_id,
  temp_blocks 
 FROM stl_disk_full_diag;

For more information, see Amazon Redshift best practices for loading data.

Datashare query blocks on consumer clusters

When a datashare query runs on the consumer cluster, data blocks associated with the query are counted on PercentageDiskSpaceUsed metrics. These data blocks remove from PercentageDiskSpaceUsed metrics because of a cluster reboot and other factors. No further actions required for this expected behavior.

Check disk space

Check the percentage of disk space under the Performance tab on the Amazon Redshift console.

Related information

Amazon Redshift performance

AWS OFFICIALUpdated 7 months ago