How is data returned from Spectrum to Redshift cluster?

0

Seeking clarification on how the results of a Spectrum query are returned to the main Redshift cluster. Presumably this would impact the sizing of the nodes in the Redshift cluster to ensure they have sufficient capacity to process the results. Namely, are all results returned to the leader node, or is there some logic that maps data into the appropriate slice? If the latter, how does the association of the data to the slice work?

AWS
Jay_M
질문됨 4년 전419회 조회
1개 답변
0
수락된 답변

More detailed info is in my blog: https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/. Spectrum layer will do what it needs to do (push-down operations). Hopefully it will "filter" out majority of the rows off S3 before sending a small portion back to the main Redshift cluster for further processing (such as joins or DISTINCT). No, a tiny 2 x dc2.large cluster would not be able to handle 1M of 1GB Parquet files in S3 and do joins on these large external tables. Each slice of the main Redshift cluster can invoke up to a max of 10 Spectrum nodes per query. Data post-Spectrum filtering will be sent to Redshift slices depending on the next step in the execution pipeline (as generated by Redshift Optimizer) and hashing values of the join/GBY columns, etc. This is not much different from performing joins between a regular Redshift table that is using DISTSTYLE EVEN and another Redshift table that uses DISTKEY distribution.

전문가
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠