Data Infrastructure on AWS without Redshift for a small business

0

I work for a fintech startup which is a small business where I'm the sole individual responsible for building data applications to inform the business. I'm looking for the best data infrastructure for a small business using a modern data stack.

To date, I've addressed all the business queries by using SQL to craft views in MySQL, connecting Power BI to these views, using DAX to shape the information, and showcasing the visualizations in Power BI. This setup has been effective so far, but I'm pondering about the next professional step, especially as we anticipate our system to grow.

Our primary data sources are:

  • RDS (MySQL): Tables for transactions, users, and logs. The largest tables have about a million rows, growing by 10 to 30 thousand rows monthly. Most are under 1GB, though there's a hefty table with several JSONs weighing 14GB.
  • DynamoDB: A table for financial decisions with 2 million rows, taking up 3GB.
  • Google Sheets: Tables with fewer than a thousand rows.
  • Bank API for fetching rates: Tables also with fewer than a thousand rows. I currently transfer this to Google Sheets and then connect.

The solution I'm considering is moving these data sources to an S3 bucket (acting as my Data Lake), and from there, to another S3 bucket added to the data catalog, which would essentially be my Data Warehouse. Then, I'd connect Power BI with Amazon Athena to these transformed databases in S3. Does this approach sound sensible to you, or would you suggest a different route? Would you store the data in S3 or opt for RDS or another solution? I'm leaning towards S3 + data catalog connected with Athena instead of Redshift for the Data Warehouse because, given our company's size, Redshift seems overpriced. We are a small team of 30, and our total AWS cost is about $6,000 a month. I turned on Redshift for three days, and it cost $600; it seems unjustifiable for our scale.

Would you carry out the ETL with Lambda, Glue, or another tool? Would you consider implementing Airflow, DBT, Git, CI/CD, or other tools at any stage of the process?

I'm grappling with these considerations and would truly appreciate any insights, opinions, or resource recommendations before undertaking these significant shifts in our organization.

Thank you in advance for your advice and expertise!

1 Answer
1

Your approach is not uncommon, and I have seen many organization first leveraging Athena to meet their query response times. Having an S3 data lake future proofs your architecture and allows you the flexibility to switch compute in the future. Redshift can be leveraged once you encounter that Athena is unable to provide you the performance you need. With Redshift Serverless you can get more powerful compute for those queries that need better performance than what Athena can offer. You pay for use and if queries execute only 3 hours a day then that is your compute cost with Redshift Serverless as there are no charges for idle times. Also, Redshift Serverless is fully integrated with S3 data lake and you can query data in-place without needing to copy data as local Redshift tables. However, I have also seen customers will create aggregated, and pre-joined data sets as Redshift local tables to meet tighter query SLA's.

profile pictureAWS
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions