EMR customer question

0

A customer is using PySpark on EMR to do some calculations.

The calculations are being saved on S3 which triggers an SQS that is triggering COPY command to redshift.

So far - all good.

They are trying to find out a solution where after the redshift is loaded with new data, we can run some queries over the specific items that were being ingested there.

They thought maybe they could fire SQS for each item in the EMR itself after it finishes to compute (sounds a bit coupled solution and not sure how robust it is).

Some more clarification - after the post calculations and insert into redshift, they need to push it to either dynamodb or other NoSQL solution for quick retrieval.

Does this sound legitimate?

已提问 7 年前209 查看次数
1 回答
0
已接受的回答

You might be able to take a look at this older blog post and modify it for your use case. When the copy command loads the data into redshift it writes an entry to Dynamo marking it processed. You could connect a Lambda function to the Dynamo streams that will allow you to trigger the queries you want to run after the data is loaded.

https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/

AWS
已回答 7 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则