Spectrum performance caching and performance

0

Hello,

  1. Does spectrum use any kind of caching, when I run a spectrum query and check in svl_query_metrics_summary table, spectrum_scan_row_count and spectrum_scan_size_mb have null values. Would this indicate that some kind of caching has happened?
  2. As per the performance best practices the preferred storage format is parquet, and the reasoning is parquet reads only projected columns. But I see similar performance with ORC files as well and could not find how much data is being read when I look into query metrics table as mentioned above. Could someone please elaborate on what storage method is preferred and why in details?
  3. I have 2 dc2.8xlarge nodes, and when I see the query report table I see 32 (total slices in the cluster) operations for reading from s3 and other operations. Does this mean 32 spectrum nodes are in action for my cluster size? Does the parallelism or the number of spectrum nodes increase if I add more nodes, and this will give more performance boost?
nmakb
asked 5 years ago839 views
2 Answers
0
Accepted Answer

Hi nmakb,

Regarding Spectrum Caching:
Depending on your region and maintenance track you may already have the feature Spectrum caching that is in currently being deployed or you may be getting in soon. In the most recent release announcement (Amazon Redshift Maintenance (February 20th - March 21st 2019) you'll find a feature described as:

• Redshift Spectrum: Spectrum Request Accelerator is automatically and transparently enabled, significantly improving the performance of queries against data in Amazon S3.

This is the Spectrum caching feature where Spectrum sub-query results get cached in S3. I gave a talk on this feature at re:Invent 2018 that included the speedup seen from this feature in preview. You can see that here: https://www.slideshare.net/AmazonWebServices/extending-analytics-beyond-the-data-warehouse-ft-warner-bros-analytics-ant301-aws-reinvent-2018?qid=f102d8de-b377-4e5e-b6e1-e0d61fbc0316&v=&b=&from_search=1.

Regarding external storage format in S3 Parquet .vs. ORC:
WRT, storage format specific I/O avoidance and data reduction you'll find that Parquet and ORC are basically the same, except for some very specific low level features that only some SerDe implementations leverage. Having used most the Big Data tools, I can say that the differentiation between Parquet and ORC storage formats is not relevant until you pair a tool with the format. It's really an artifact of the Cloudera .vs. Hortonworks camps, which with their recent merger will soon melt away like snow in springtime. If you look back thru the Spectrum release history you'll see that for Redshift Parquet came from the very beginning of Spectrum and long before ORC. Hence, Redshift the SerDe for Parquet is more mature than the ORC SerDe. In every way that I have looked Parquet has a measurable performance benefit over ORC in Redshift. On the other hand, even in the AWS Big Data tools you can find exactly the opposite where with EMR Presto for example the opposite us generally true because Facebook, the originators of Presto, predominately use ORC, so ORC came first in Presto.

Regarding Spectrum layer parallelism:
Event though it's not explictly documented in the online Redshift doc. (at least I can't find it) many public Redshift talks have described the Spectrum layer parallelism as up to 10x Spectrum nodes per compute node slice. So, not only is Spectrum an order of magnitude more parallel than slices it scales up as the cluster's slices scale up. I have personally seen near linear scalability of the Spectrum layer across Redshift cluster of various number of nodes and instance types, meaning different number of slices.

I hope this answers your questions.
-Kurt

klarson
answered 5 years ago
0
  1. Does spectrum use any kind of caching, when I run a spectrum query and check in svl_query_metrics_summary table, spectrum_scan_row_count and spectrum_scan_size_mb have null values. Would this indicate that some kind of caching has happened?

I don't know. It might be cached by the normal caching mechanisms, or it might be that the data volume is expected to be large so caching is dispensed with.

  1. As per the performance best practices the preferred storage format is parquet, and the reasoning is parquet reads only projected columns. But I see similar performance with ORC files as well and could not find how much data is being read when I look into query metrics table as mentioned above. Could someone please elaborate on what storage method is preferred and why in details?

Parquet and ORC are both sorted column-store formats. Barring implementation issues, I would expect them to have much the same performance. Note I have never used either with Spectrum, so there may well be in practise particular issues with one or the other format.

I had the impression Parquet generally had better impementation in Redshift, which is why it was preferred. (I recall for example nested data support originally being available for Parquet but not ORC).

  1. I have 2 dc2.8xlarge nodes, and when I see the query report table I see 32 (total slices in the cluster) operations for reading from s3 and other operations. Does this mean 32 spectrum nodes are in action for my cluster size? Does the parallelism or the number of spectrum nodes increase if I add more nodes, and this will give more performance boost?

Spectrum uses up to 10 Spectrum worker nodes per slice in your cluster. The number used is controlled by Spectrum, not you, and depends on what it thinks of your query and whatever other factors it takes into account.

In your particular case, I'm not sure what you mean by "query report". Perhaps that report is only showing Redshift worker slices, and not Spectrum nodes, or perhaps for your query, only one Spectrum worker came into play, for each Redshift slice.

Toebs2
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions