EMR customer question

0

A customer is using PySpark on EMR to do some calculations.

The calculations are being saved on S3 which triggers an SQS that is triggering COPY command to redshift.

So far - all good.

They are trying to find out a solution where after the redshift is loaded with new data, we can run some queries over the specific items that were being ingested there.

They thought maybe they could fire SQS for each item in the EMR itself after it finishes to compute (sounds a bit coupled solution and not sure how robust it is).

Some more clarification - after the post calculations and insert into redshift, they need to push it to either dynamodb or other NoSQL solution for quick retrieval.

Does this sound legitimate?

asked 7 years ago196 views
1 Answer
0
Accepted Answer

You might be able to take a look at this older blog post and modify it for your use case. When the copy command loads the data into redshift it writes an entry to Dynamo marking it processed. You could connect a Lambda function to the Dynamo streams that will allow you to trigger the queries you want to run after the data is loaded.

https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/

AWS
answered 7 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions