Reading millions of messages from SQS

0

Hi, I am working on a requirement where producer application will be sending approx. 3 million JSON messages per day each of approx 2 KB. These messages needs to be stored on S3 for further processing and then move to Snowflake for reporting.

Will Lambda with concurrency be enough for processing? Should it be Scheduled or triggered for each messages? How to read messages in a batch of 10000 from standard queue? Is Snowflake right choice for storage (in normalized form) and analysis?

Any suggestions/recommendations are welcome.

Thanks, Satyen.

3 Answers
1

To efficiently process and store 3 million JSON messages per day into Snowflake, consider leveraging a combination of Amazon Kinesis, AWS Glue, and Amazon S3:

Services Workflow:

  1. Amazon Kinesis Data Streams: Utilize Kinesis for high-throughput, real-time data streaming. This service seamlessly scales to accommodate your data volume, ensuring smooth ingestion of JSON messages.
  2. Kinesis Data Firehose: This service facilitates the automatic transfer of your data streams into Amazon S3. It supports batching, compression, and encryption of the data, enhancing both cost efficiency and security.
  3. AWS Glue: Employ AWS Glue for its powerful ETL capabilities. It can process data stored in S3, executing transformations to refine the data, making it ready for analytical use in Snowflake.
  4. Data Loading to Snowflake: Leverage Snowpipe for effective data loading from S3 into Snowflake. Snowpipe's near-real-time loading capability ensures your data is quickly available for analysis and reporting.

This streamlined architecture is designed for high-volume data handling, offering scalability, cost savings, and minimal management overhead, ideal for your data ingestion and analysis needs.

Updates based on teh comments below:

  • Use AWS Lambda functions triggered by SQS messages to process data in batches. Since SQS triggers process up to 10 messages per invocation. You can store these messages temporarily in a scalable storage service like Amazon DynamoDB or Amazon Elastic Cache.
  • Implement a scheduled Lambda function or use an Amazon CloudWatch Events rule to trigger a Lambda function at regular intervals. This function would aggregate messages from the temporary storage and batch them into larger files for storage in S3, optimizing for both performance and cost.
  • Once data is in S3, use Snowflake’s Snowpipe to load data into Snowflake efficiently.
profile picture
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
  • Thanks but In my case SQS is the source of data but for Kinesis Data Streams SQS is a consumer. Can Kinesis Data Streams consume data from SQS? I don't think so, please let me know. Also, reporting is to be done for the data arrived in past one hour, there is no real time reporting requirements.

  • Updated my response

1

Hi,

You did not mention the shape of your traffic: if constant, it makes only 35 msg/s which is not that big at all. Default parallelism of Lambda is 1'000 (and you can increase). Also, look at the throughput numbers for SQS queues on https://aws.amazon.com/sqs/features/ : they can go very high.

If you expect peaks, you should update your questions with those peaks.

So, what I'd do if I were you:

  1. create a traffic generator matching your pattern as a simulator to test your system to be developed
  2. Implement the simplest possible system: 1 lambda reads and stores 1 message in S3. Then a second Lambda triggered from S3 to forward to Snowflake (if you need real-time transfer)
  3. See where it brings you.
  4. If you don't get the scaling you need, go to more sophisticated stuff with Kinesis, etc.

I personally try to avoid upfront engineering: it makes your dev (unduly) complicated. So, trust the serverless architecture of Lambda and SQS to scale properly for you. If it doesn't and only at this moment, go for more sophistication.

In that vein of remaining simple, why not AWS DynamoDB instead of Snowflake ? It scale much higher than you need and can be a data source for all AWS Analytics service.

Best,

Didier

profile pictureAWS
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
  • Thanks. Lambda is our first choice but it has max 15 minutes run time hence I am not sure. But I guess concurrency can solve this problem.

    I like your suggestion about "upfront overengineering" thanks a lot for that.

    We need to store that data in a structured format hence Snowflake. Besides that there will be only couple of user analyzing the data using PowerBI for which Snowflake is sufficient.

    Having said that can DynamoDB provide more cost efficient storage as compared to S3?

0

Processing 3 million records per day by Lambda is an easy task. Lambda can scale up to 1000 concurrent invocation per queue, so you can process those messages really quickly.

Lambda can use large batches when consuming from SQS. You can have up to 10,000 messages per batch, as long as the total size does not exceed the 6MB Lambda payload size. You can also specify a batch window so that you maximize the actual batch size by waiting for more records to arrive to the queue.

You did not mention how you want the data in S3. Will each message be its own file? Will they be combined to a single file? If the former, you will just need to write the events directly from Lambda to S3. If the latter, use Data Firehose (formerly Kinesis Data Firehose), to deliver the messages from Lambda to S3. Once there, you can also use Athena to query the data, so maybe you do not need Snowflake at all. Depending on your use case.

profile pictureAWS
EXPERT
Uri
answered 2 months ago
profile picture
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions