EMR customer question

0

A customer is using PySpark on EMR to do some calculations.

The calculations are being saved on S3 which triggers an SQS that is triggering COPY command to redshift.

So far - all good.

They are trying to find out a solution where after the redshift is loaded with new data, we can run some queries over the specific items that were being ingested there.

They thought maybe they could fire SQS for each item in the EMR itself after it finishes to compute (sounds a bit coupled solution and not sure how robust it is).

Some more clarification - after the post calculations and insert into redshift, they need to push it to either dynamodb or other NoSQL solution for quick retrieval.

Does this sound legitimate?

質問済み 7年前209ビュー
1回答
0
承認された回答

You might be able to take a look at this older blog post and modify it for your use case. When the copy command loads the data into redshift it writes an entry to Dynamo marking it processed. You could connect a Lambda function to the Dynamo streams that will allow you to trigger the queries you want to run after the data is loaded.

https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/

AWS
回答済み 7年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ