- Newest
- Most votes
- Most comments
When choosing your distkey , Choose the key based on how frequently it is joined and the size of the joining rows.Also check the distribution profile to ensure it is uniformly distributed.Ensure that the column does not have a significant amount of null value also check the level of data skew.
Designate both the dimension table's primary key and the fact table's corresponding foreign key as the DISTKEY. Also while choosing the column for your list key ensure to choose the the column that have the maximum distinct value and high cardinality You mentioned you have tried both DISTSTYLE EVEN and DISTKEY ,have you tried using AUTO where Redshift automatically chooses the best distribution based on your data. For Best practices for choosing the distribution style:https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html
Based on the explanation you gave above having your sort key as a time stamp is a good practice. I added additional reading below for best practices when choosing a sort key : https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
The choice of your DIST key will be more important for what joins you will be doing with this table as this determines whether a distribution or broadcast will needed see:
https://docs.aws.amazon.com/redshift/latest/dg/c_data_redistribution.html
Ideally your DIST key on joins will match each table so 0 distribution is needed. If you're just scanning the table and there are no joins the DIST Key will only affect performance if you choose a key with skewed values.
For example, if you distribute on a column with only two values you will only distribute on 2 nodes out of 4. So you want ideally a multiple similar (at larger distinct value counts this matters less) to the number of nodes that you also have.
Your sort key is most important for your scans, as it allows redshift to skip multiple blocks that don't contain the values you want. Each block contains a header with the range of value it contains. If that range doesn't match the criteria redshift won't scan that block saving your i/o and time. If you will be using multiple common sort columns choose compound and filter down then. Otherwise interleaved may be the best
I have created compound SORT KEY with time and username since this is access logs data. While only filtering it is fast but since I have to use group by and order by also it is taking more than 120 secs for 30 million records only, I don't know what will happen with 225 million. I have given DISTSTYLE EVEN and skew percentage is 2.16% and I do filter on time field every time which I have provided in SORT key, even for only 3 months data it is working very slow which I did not expect with data warehouse.
Relevant content
- asked 8 months ago
- Accepted Answerasked 5 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
No join is required on this table but group by and order by is the mandatory thing because in the compound SORT Key I have provided time and username but if user asked for descending order then I have sort in descending order which is taking near about 120 seconds and for this query I am testing on only 30 million records not 225. I have tried with even distribution also but still it is taking so much time, only selection is fast but when we use group and order by then drastic change in fetching time on only 30 million, previously in elastic search it was very fast with same query