Request to validate an approach

0

Hi,

I am considering Redshift as an option for a new design. We have ~6B of billing events a day where the raw data size of each one is ~1KB (say 12 different fields/columns). My idea is that it is ingested via Kinesis Firehose to Redshift. Then there would be other derived tables/views doing aggregation and enrichment (joins to other tables). An obvious challenge, seems to be that first table which would contain a lot of data both in terms of volume as well as number of records. I am looking for a validation that a typical cluster (I am thinking of say 3 ra3.4xlarge) would handle constant ingestion at this rate plus auto-refreshing MV-s and some minor other queries/exports. Also there would be some retention for that table, say 2-3 months.

Looking forward to hearing your thoughts. thanks, PB

Przemek
asked 7 months ago187 views
2 Answers
0
Accepted Answer

Hello.

Your 3-node cluster of ra3.4xlarge instances will run out of storage in about 2 months (assuming linear ingestion and no optimizations/compressions). Redshift does support automatic compression which can reduce the storage requirements, but this varies based on the type of data. So, even with compression, you might not have enough storage for 3 months of data. You'd need to consider a larger cluster size or regularly offload older data to more cost-effective storage options like Amazon S3 using Redshift Spectrum.

Redshift with the use of Kinesis Firehose is quite powerful and can handle large-scale ingestion. However, the rate at which you're ingesting (6TB/day) is significant. While Kinesis Firehose is designed to handle large-scale streaming data, there will be a need for proper optimization. You'll need to batch the records appropriately, tune COPY commands, and ensure that the distribution style of the table is optimized for your ingestion patterns.

Having Materialized Views (MV) that auto-refresh, along with the high ingestion rate, can put a strain on the cluster. The performance of the cluster will depend on the complexity of the MVs, frequency of refresh, and other concurrent operations (like other queries and exports). Ensuring the queries are optimized, using distribution and sort keys effectively, and monitoring the query performance will be crucial.

If you plan to retain 2-3 months of data in the cluster, you'll need to implement a strategy for regular deletion of old data, given your high ingestion rate. Using time-based partitioning can help in more efficient data deletion.

Best regards, Andrii

profile picture
EXPERT
answered 7 months ago
  • Thanks! The number 3 was just made up. I didn't mean 30 for sure. I assumed compression ratio say 6x. In general what I care about it scalability. The number of nodes may be tweaked, may be 6 as well so the retention, but I was more looking to validate this concept. The way I see this is a bit like a data processing pipeline inside a database. I would love to hear about similar cases and how it ended up.

  • Is it really 6TB/day? From the question looks like 6 billion records/day.

0

It is 6B (10^9) of events 1KB (10^3) each so indeed 6 TBs (10^12) of uncompressed data.

Przemek
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions