- Newest
- Most votes
- Most comments
Hello Rafael,
I would say that you should never ever implement option 2 on Redshift.
Redshift is a columnar database and have no row indexes like OLTP DBs you might be used to. That means that it's hard for Redshift to find specific rows. On the other hand it stores data separately for each column, which makes it easy to access columns independently.
To speed up queries you can order your data with SORT KEYs. If you have no sort key in your table then you will be most of the time doing a full scan. If your query include JOINs you must also take in account DIST KEYs, which will tell Redshift how to distribute data across cluster's nodes and slices.
It's also important to choose the right DATA TYPE and COMPRESSION ENCODING for each column to achieve maximum performance.
Is this what you shared your real data ? Or there are more columns ? I could help you more if I had a full overview of your data schema.
Hope this gives you a start, feel free to comment if you have any doubts.
Relevant content
- asked 2 years ago
- Accepted Answerasked 4 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
Thanks for your idea. It's make sense to use the option 1 which Redshift is columar database.
But, the data scheme just is an example, and in actual, there're more measure column than that, means that the table in option 1 will many more columns and also, the table in option 2 with have many value in 'balance_type' column. You can imagine that, in a banking business domain, there will be many types of balances of account (closing balance, in-due balance, past-due balance, interest balance, fee amount, ...), which can be added during the implementation process and future.
I concern if using option 2, and set 'SORTKEY' for table in 'balance_type' column, and almose query behaviour using EDW data must filter in certain values in balance_type column. If so, could it be equivalent to option 1? Thanks.
I see that you are trying to overcome the challenges of schema evolution on a relational database by having this approach. If you do what you are proposing you lose type enforcement for each column, which means that if you have different data types you would need to choose the most "permissive" what means that if you have a string this column will have to be VARCHAR no matter what and every operation will have to cast data types based on measure_type to achieve desired output. Not to mention that you would lose optimal compression for each type too.
Thanks for your comment. But in this case, all columns have same data type numeric(23,2) which represents account balance. I understand that, in case of different data types, the design should be kept separate columns. Please help me understand in case the columns have the same data type, and I use sortkey to improve performance.