Skip to content

Data Infrastructure on AWS without Redshift for a small business

0

I work for a fintech startup which is a small business where I'm the sole individual responsible for building data applications to inform the business. I'm looking for the best data infrastructure for a small business using a modern data stack.

To date, I've addressed all the business queries by using SQL to craft views in MySQL, connecting Power BI to these views, using DAX to shape the information, and showcasing the visualizations in Power BI. This setup has been effective so far, but I'm pondering about the next professional step, especially as we anticipate our system to grow.

Our primary data sources are:

  • RDS (MySQL): Tables for transactions, users, and logs. The largest tables have about a million rows, growing by 10 to 30 thousand rows monthly. Most are under 1GB, though there's a hefty table with several JSONs weighing 14GB.
  • DynamoDB: A table for financial decisions with 2 million rows, taking up 3GB.
  • Google Sheets: Tables with fewer than a thousand rows.
  • Bank API for fetching rates: Tables also with fewer than a thousand rows. I currently transfer this to Google Sheets and then connect.

The solution I'm considering is moving these data sources to an S3 bucket (acting as my Data Lake), and from there, to another S3 bucket added to the data catalog, which would essentially be my Data Warehouse. Then, I'd connect Power BI with Amazon Athena to these transformed databases in S3. Does this approach sound sensible to you, or would you suggest a different route? Would you store the data in S3 or opt for RDS or another solution? I'm leaning towards S3 + data catalog connected with Athena instead of Redshift for the Data Warehouse because, given our company's size, Redshift seems overpriced. We are a small team of 30, and our total AWS cost is about $6,000 a month. I turned on Redshift for three days, and it cost $600; it seems unjustifiable for our scale.

Would you carry out the ETL with Lambda, Glue, or another tool? Would you consider implementing Airflow, DBT, Git, CI/CD, or other tools at any stage of the process?

I'm grappling with these considerations and would truly appreciate any insights, opinions, or resource recommendations before undertaking these significant shifts in our organization.

Thank you in advance for your advice and expertise!

2 Answers
2

Your approach is not uncommon, and I have seen many organization first leveraging Athena to meet their query response times. Having an S3 data lake future proofs your architecture and allows you the flexibility to switch compute in the future. Redshift can be leveraged once you encounter that Athena is unable to provide you the performance you need. With Redshift Serverless you can get more powerful compute for those queries that need better performance than what Athena can offer. You pay for use and if queries execute only 3 hours a day then that is your compute cost with Redshift Serverless as there are no charges for idle times. Also, Redshift Serverless is fully integrated with S3 data lake and you can query data in-place without needing to copy data as local Redshift tables. However, I have also seen customers will create aggregated, and pre-joined data sets as Redshift local tables to meet tighter query SLA's.

AWS
EXPERT
answered 2 years ago
1

Hello

There are cases where S3 + Athena won't be able to reach the same performance figures as Redshift or RDS, especially for very complex queries. However if your data and the queries you run perform well on Athena, there is certainly possibility to have significant savings. Better is to run a small PoC to see which one of the possible solutions corresponds better to your case.

As for ETL, if your transformations are simple and short enough, then Lambda will probably fit there very well, and with Step Functions orchestration the jobs/workflows will also be easy to maintain. Glue is more powerful if you need that and won't have limitations of the Lambda.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.