Skip to content

Issue distributing data among nodes

0

Hi! I'm currently facing an issue trying to equally distribute data among nodes, this is my current situation: current status I used this query to look for node0 MB vs the other 2 nodes to look for imbalances.

with space_used (schemaname, tablename, mbytes, avg_node_mbytes, node0_mbytes) as ( select trim(pgn.nspname), trim(tbl.name), sum(bl.mbytes), sum(bl.mbytes) / (select count(distinct node) from stv_slices), sum(case when bl.node = 0 then bl.mbytes else 0 end) from ( select id, name, sum(rows) as rows from STV_TBL_PERM group by db_id, id, name ) as tbl join PG_CLASS as pgc on pgc.oid = tbl.id join PG_NAMESPACE as pgn on pgn.oid = pgc.relnamespace join ( select bl.tbl, s.node, count(*) as mbytes from stv_blocklist bl join stv_slices s on s.slice = bl.slice group by bl.tbl, s.node ) bl on bl.tbl = tbl.id group by 1, 2 ) select * from space_used order by (node0_mbytes - avg_node_mbytes) desc limit 20;

This is the result: Enter image description here I changed several tables, for example rows 2, and 5 from the original distkey to diststyle EVEN, and there is still imbalance between nodes.

This imbalance has a huge impact on processor usage: Enter image description here

What I'm doing wrong? is there a way to know specifically witch tables are impacting on the overuse of node0, also, why EVEN is not "evenly" distributing the tables?

Thanks!

asked 2 years ago862 views
2 Answers
0

Try running VACUUM and ANALYZE. This can help re-sort the tables and update statistics, which might improve performance.

VACUUM FULL;
ANALYZE;
EXPERT
answered 2 years ago
EXPERT
reviewed 2 years ago
  • I am running daily analyze for tables with >5% stat_off or >5% unsorted, ran a vacuum FULL 100 percent on those tables that I switched to EVEN after changing the diststyle. No difference

0

Hello,

In Redshift data distribution happens across slices. A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.

The number of slices per node is determined by the node size of the cluster. For more information about the number of slices for each node size, refer the doc. To check the number of slices for each node, please run below query

select * from stv_slices;

Please see if the slices are equally distributed across all compute nodes.

When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and run queries efficiently. In case of even data will be evenly distributed across all slices. When you choose ALL based distribution, a copy of the entire table is distributed to every node.

Then run query from link to get data distribution across slices. Result will show table wise distribution across slice. Use this information to identify the table/s with uneven distribution.

AWS
SUPPORT ENGINEER
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.