- Newest
- Most votes
- Most comments
Hello.
Your 3-node cluster of ra3.4xlarge instances will run out of storage in about 2 months (assuming linear ingestion and no optimizations/compressions). Redshift does support automatic compression which can reduce the storage requirements, but this varies based on the type of data. So, even with compression, you might not have enough storage for 3 months of data. You'd need to consider a larger cluster size or regularly offload older data to more cost-effective storage options like Amazon S3 using Redshift Spectrum.
Redshift with the use of Kinesis Firehose is quite powerful and can handle large-scale ingestion. However, the rate at which you're ingesting (6TB/day) is significant. While Kinesis Firehose is designed to handle large-scale streaming data, there will be a need for proper optimization. You'll need to batch the records appropriately, tune COPY commands, and ensure that the distribution style of the table is optimized for your ingestion patterns.
Having Materialized Views (MV) that auto-refresh, along with the high ingestion rate, can put a strain on the cluster. The performance of the cluster will depend on the complexity of the MVs, frequency of refresh, and other concurrent operations (like other queries and exports). Ensuring the queries are optimized, using distribution and sort keys effectively, and monitoring the query performance will be crucial.
If you plan to retain 2-3 months of data in the cluster, you'll need to implement a strategy for regular deletion of old data, given your high ingestion rate. Using time-based partitioning can help in more efficient data deletion.
Best regards, Andrii
It is 6B (10^9) of events 1KB (10^3) each so indeed 6 TBs (10^12) of uncompressed data.
Relevant content
- asked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Thanks! The number 3 was just made up. I didn't mean 30 for sure. I assumed compression ratio say 6x. In general what I care about it scalability. The number of nodes may be tweaked, may be 6 as well so the retention, but I was more looking to validate this concept. The way I see this is a bit like a data processing pipeline inside a database. I would love to hear about similar cases and how it ended up.
Is it really 6TB/day? From the question looks like 6 billion records/day.