Skip to content

Redshift cluster uses only 6 of 8 slices after scaling from 2 to 4 nodes

0

Cluster info

• Node type & count: 4 nodes (ra3.xlplus) • Redshift version: 1.0.121010 (patch 193)

Problem summary

After scaling the cluster from 2 to 3 and later 4 nodes we observed that the new nodes use only one slice each, while the older nodes use two slices each. By default every node has 2 slices. As a result, there is unbalanced data distribution between nodes: the older nodes constantly run at 70–80 % CPU, while the new nodes stay around 30 %

What I have already tried

• Re-created large tables via UNLOAD … PARQUET → COPY both with DISTSTYLE EVEN and with the original DISTKEY • Tested different file sizes (30 MB and larger) • Checked distribution using system views:

SELECT node, slice FROM stv_slices ORDER BY node, slice; SELECT slice, COUNT(*) AS blocks FROM stv_blocklist WHERE tbl = (SELECT oid FROM pg_class WHERE relname = '<table>') GROUP BY slice;

• Data is redistributed, but I still see a maximum of 6 active slices instead of 8 (for a 4-node cluster)

Expected behaviour

With 4 nodes the cluster should have 8 active slices and use all of them after a deep copy so that CPU load is evenly distributed

Actual behaviour

• stv_slices shows 8 slices, but the affected tables use only 5–6 slices • CPU on the older nodes remains high even after data redistribution

Request

Please, suggest an advice so that the cluster is correctly utilizing the new nodes and slices. Maybe there is an issue with cluster configuration

asked 9 months ago191 views
1 Answer
1

This issue appears to be related to how Amazon Redshift distributes data across slices after scaling your cluster. When you scale a Redshift cluster, the data distribution doesn't automatically rebalance across all slices optimally.

Here's what's happening and what you can try:

  1. Data Distribution Skew: Even though you have 8 slices (2 per node) in your 4-node ra3.xlplus cluster, your data is not being evenly distributed across all slices. This is causing CPU imbalance with older nodes working harder than newer ones.

  2. VACUUM and ANALYZE: Run a full vacuum and analyze on your tables. This can help redistribute data and update statistics:

VACUUM FULL;
ANALYZE;
  1. Check Data Distribution: Run a diagnostic query to identify tables with data skew:
select trim(pgn.nspname) as schema, 
trim(a.name) as table, id as tableid, 
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, 
dist_ratio.ratio::decimal(10,4) as skew, 
det.head_sort as "sortkey", 
det.n_sortkeys as "#sks", b.mbytes,  
decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total, 
decode(det.max_enc,0,'n','y') as enc, a.rows, 
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows , 
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted 
from (select db_id, id, name, sum(rows) as rows, 
sum(rows)-sum(sorted_rows) as unsorted_rows 
from stv_tbl_perm a 
group by db_id, id, name) as a 
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes 
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid, 
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort , 
max(attsortkeyord) as n_sortkeys, 
max(attencodingtype) as max_enc 
from pg_attribute group by 1) as det 
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio 
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice ) 
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null 
order by mbytes desc;
  1. Deep Copy with DISTSTYLE AUTO: For tables showing significant skew, try recreating them with DISTSTYLE AUTO instead of EVEN or specific DISTKEY. This lets Redshift choose the optimal distribution style:
CREATE TABLE new_table DISTSTYLE AUTO AS SELECT * FROM original_table;
  1. Batch Processing for Large Tables: For very large tables, consider using a batched approach for the deep copy operations to improve performance.

  2. Check for Missing Statistics: Run this query to identify tables with missing statistics that might be affecting distribution:

select substring(trim(plannode),1,100) as plannode, count(*) 
from stl_explain 
where plannode like '%missing statistics%' 
group by plannode 
order by 2 desc;

If these steps don't resolve the issue, you may need to contact AWS Support as there could be an underlying configuration issue with your cluster after scaling.
Sources
Redshift diagnostics queries to identify the performance bottleneck | AWS re:Post
Issue distributing data among nodes | AWS re:Post
Data distribution for query optimization - Amazon Redshift
Improve COPY ingestion performance for large data loads on Amazon Redshift | AWS re:Post

answered 9 months ago
EXPERT
reviewed 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.