Some Redshift questions

0

Hi @Everyone,

Could someone enlighten me?

  1. What is the difference between a clustered index and a SORTKEY?
  2. In redshift the concept of index has been substituted by SORTKEY, right?
  3. Can I partition tables like in transactional DB in RedShift?
  4. What consideration I should take when it comes to selecting between a SORTKEY or DISTKEY?

Sorry for asking several questions at the same time.

Highly appreciate the clarifications

Amy
asked a year ago306 views
1 Answer
2
Accepted Answer

Hi Amy, Thank you for reaching out. Please find the answers to your questions below.

  1. What is the difference between a clustered index and a SORTKEY?

Answer: Redshift does not support clustered indexes. I understand you are trying to compare the SORT KEY with a CLUSTERED INDEX in some other relational databases. To understand the difference, you will need to understand the columnar nature of data storage in Redshift. This page in the Redshift documentation will help you understand that.

In a database system that stores data in rows, the clustered index would physically order the row-level data according to the indexed column. In Redshift, the data on disk is ordered according to the sort key for that column.

  1. In Redshift the concept of index has been substituted by SORTKEY, right?

Answer: Sort keys in Redshift provide similar benefits a clustered indexes, that is, order data at storage level to reduce table scans and serve better performing queries.

  1. Can I partition tables like in transactional DB in Redshift?

Answer: You do not partition tables in Redshift. It stores columnar data in 1 MB disk blocks ordered by the SORT KEY. The min and max values for each block are stored as part of the metadata. If a query uses a data ranges (say last year or last month or between customer dates), the query processor can use the min and max values in the metadata to rapidly skip over large numbers of blocks during table scans.

For example, suppose that a table stores five years of data sorted by date and a query specifies a date range of one month. In this case, you can eliminate up to 98 percent of the disk blocks from the scan.

When you query EXTERNAL TABLES (for data stored in Amazon S3) in Redshift using Redshift Spectrum queries, you can restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. You can partition your data by any key on EXTERNAL TABLES. Refer to this documentation.

  1. What consideration I should take when it comes to selecting between a SORTKEY or DISTKEY?

Answer: Firstly, considering using automatic table optimization, so you don't need to choose the sort key or distribution key of your table.

If you want to select them manually, you would have ti understand the patterns of the queries hitting the table. For instance, if recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Refer to this page.

For choosing the right DISTRIBUTION STYLE, refer to the guidelines provided in this page.

Again, working with automatic table optimization is recommended so you do not need to worry about choosing the right SORT KEY or DISTRIBUTION KEY.

If you have further questions, please feel free to ask.

AWS
bhaums
answered a year ago
  • Thank you very much for your answers @bhaums. Really appreciate. Have a nice day!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions