Question on query performance

0

Hi All, In one of our Redshift database most of the columns being stored as length 500 bytes even it holds real data of length less than 50 bytes or smaller. Understanding was that , as Redshift compresses the columns so the physical disk storage will get rid of those blank Avg 450 bytes(500bytes-50bytes) spaces automatically and only consume the compressed version of ~50 bytes of actual data, so it wont harm us anyway even we define larger length for a column.

However after seeing below doc,it seems the assumption is simply wrong. It says, during query execution Redshift parks the intermediate results in temporary table format in each compute nodes memory and that data is stored as uncompressed so it will consume the defined data type size rather the actual data length or compressed data length. https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-smallest-column-size.html

Want to understand from experts here ,

  1. If its a real issue and if there exists any way to OR any ready-made dictionary/system table/views exists which logs all such tables/columns which we need to fix to avoid performance overhead?

2)And also we ran some sample query to test the above behaviour and verified results from SVL_QUERY_REPORT and SVL_QUERY_SUMMARY, it shows 75% degradation in performance. Not sure if its the correct way of checking? or any other possible way to verify the performance (or DB resource usage like CPU, I/O, Memory consumption) for a query execution in Redshift. Please guide me here.

asked a year ago240 views
2 Answers
2

Redshift Auto workload management (AutoWLM) allocates memory for each query based on the column sizes. If the actual data is much lesser than column sizes then memory is over-allocated and this impacts your throughput as lesser queries will be executed concurrently.

You might see overallocation of memory in workmem column of SVL_QUERY_REPORT for the scan step, but it is not an easy or direct correlation which can confirm the column sizing issue you are facing. You should start with the largest width column and work your way down the list by comparing the actual data contained in that column.

profile pictureAWS
answered a year ago
0

Thank You So much Milind for the clarification. So if i get it correct , it means if we allocate unnecessary more data length to the attributes then the number of queries which needs to be executed using those tables will be lesser at any point in time. But individual query performance should not degrade.

But the odd thing which our team is stating is that , e.g. if we have table with column having actual data length of 50 bytes. And we run a query with the defined data length(say same 50 bytes) VS the same table structure with bigger defined data length(500 bytes) then the performance will be degraded by 75% in case of bigger defined data length table access , which doesn't sounds to be correct.

I want to test performance of sample read/write queries in above scenario with some large volume of sample data(say 50-100 million rows) to prove the theory. So is there any quick way of creating such bigger data volume tables in redshift without using copy from S3. Something similar to procedural code using INSERT statement i.e bulk insert within a loop. Is it possible?

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions