How to select DIST key in Redshift for fast searching and aggregation

0

I have a data of 225+ million in my Redshift Table. This data is the activity logs of the user who are coming and going at that time after scanning the door like the access logs of the user at what time they are taking entry and exit from the door.

Now I want to select the best DIST key to divide this data across the nodes so that processing and aggregation is faster. The main fields I have are : (id unique, user_name, door_name, company_name, time timestamp(YYYY-MM-DD HH:mm:ss)) Now I want to fetch data of the user on that particular day for which I already prepared the query so can you please let me know which one is the best DIST key here to use out of these fields or do we need to create a custom DIST key?

I have provided the sort key on time field as this is mandatory field but to get the user data on each day aggregation is required like break the timestamp into day format and then group by transformed time and user_uuid so what is the best DIST key to provide and company_name is also mandatory field.

Since we can only select one DIST key per table it is the most important step. Also whenever we use DIST key in the filter ORDER BY and LIMIT does not work is there any alternative to it?

Thanks

Vishal
asked 10 months ago759 views
3 Answers
2

When choosing your distkey , Choose the key based on how frequently it is joined and the size of the joining rows.Also check the distribution profile to ensure it is uniformly distributed.Ensure that the column does not have a significant amount of null value also check the level of data skew.

Designate both the dimension table's primary key and the fact table's corresponding foreign key as the DISTKEY. Also while choosing the column for your list key ensure to choose the the column that have the maximum distinct value and high cardinality You mentioned you have tried both DISTSTYLE EVEN and DISTKEY ,have you tried using AUTO where Redshift automatically chooses the best distribution based on your data. For Best practices for choosing the distribution style:https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html

Based on the explanation you gave above having your sort key as a time stamp is a good practice. I added additional reading below for best practices when choosing a sort key : https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

AWS
answered 7 months ago
0

The choice of your DIST key will be more important for what joins you will be doing with this table as this determines whether a distribution or broadcast will needed see:

https://docs.aws.amazon.com/redshift/latest/dg/c_data_redistribution.html

Ideally your DIST key on joins will match each table so 0 distribution is needed. If you're just scanning the table and there are no joins the DIST Key will only affect performance if you choose a key with skewed values.

For example, if you distribute on a column with only two values you will only distribute on 2 nodes out of 4. So you want ideally a multiple similar (at larger distinct value counts this matters less) to the number of nodes that you also have.

answered 10 months ago
  • No join is required on this table but group by and order by is the mandatory thing because in the compound SORT Key I have provided time and username but if user asked for descending order then I have sort in descending order which is taking near about 120 seconds and for this query I am testing on only 30 million records not 225. I have tried with even distribution also but still it is taking so much time, only selection is fast but when we use group and order by then drastic change in fetching time on only 30 million, previously in elastic search it was very fast with same query

0

Your sort key is most important for your scans, as it allows redshift to skip multiple blocks that don't contain the values you want. Each block contains a header with the range of value it contains. If that range doesn't match the criteria redshift won't scan that block saving your i/o and time. If you will be using multiple common sort columns choose compound and filter down then. Otherwise interleaved may be the best

answered 10 months ago
  • I have created compound SORT KEY with time and username since this is access logs data. While only filtering it is fast but since I have to use group by and order by also it is taking more than 120 secs for 30 million records only, I don't know what will happen with 225 million. I have given DISTSTYLE EVEN and skew percentage is 2.16% and I do filter on time field every time which I have provided in SORT key, even for only 3 months data it is working very slow which I did not expect with data warehouse.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions