What are solutions for move daily transactional data of AWS RDS PostgreSQL to S3?

0

Hi,

My product is multi-tenant and each tenant is having isolated AWS RDS PostgreSQL DB. I have over 2000 tenants i.e. 2000 RDS dbs and over a period it will grow continuously.

My product is kind of integration tool which fetch lot of data from different components so the product should persist each request/response payload. Only for high volume tenants the DB is growing each day 4 to 8 GB and most of the times the tenant use the most recent data like last 4 to 7 days. Hence we decided to move any data older than 7 days to s3 and if user requested any data which is older 7 days then we are planning to retrieving from s3 ( i.e. using glue and Athena on top of S3).

Kindly suggest what are the best and cost optimal approaches we have to move the data from AWS RDS PostgreSQL to S3?

NOTE: Sometimes if a tenant is excessed the data max limit ( ex: 20 GB before 7 days) then we must move the data even before 7 days also, like trigger moving the data on demand/dynamic nature.

  1. Make sure the data must not lost, data at time should query either from RDS or from Athena
  2. Even live eventing or stream or CDC approach might also possible to implement but looking for some robust and cost optimal as the tenant grows very fast
  3. As of now we are running Lambda function daily twice to copy the data from RDS to S3 then delete. This process is not optimal approach and moreover it is time consuming
  4. We thought to use AWS DMS CDC so kind of live streaming to S3 and Glue Crawler also do live indexing the S3 copied data but is DMS CDC is cost optimal for large number of RDS and moreover 20% of high volume tenants.
2 Answers
1
  • Hi Mi Sha, I was not looking for any backup solution. We are exceeding the size of data very quick hence like to move the older than 7 days data to S3 because client usually play around with last 7 days data in 80% use cases. If any tenant asks older than 7 days data then with Glue with Athena we are retrieving from S3.

1

Hi,

Please, have a look at aws_s3 extension for pg: it was designed for use cases very close to yours

See https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postgresql-s3-export.html

It is highly cost-efficient: it will run in your RDS instance at no additional cost. You will indeed pay only the cost of your S3 option (choose the right one for your use case to remain cost-optimal).

If you this extension within a stored procedure, you don't have any external compute service (Lambda to use).

Finally, you can schedule it directly within pg by following this blog post: https://aws.amazon.com/blogs/database/schedule-jobs-with-pg_cron-on-your-amazon-rds-for-postgresql-or-amazon-aurora-for-postgresql-databases/

So, by doing this, you'll have a very "compact" (all in pg) and cost-efficient solution, which only adds the cost of S3 to your current costs.

Best,

Didier

profile pictureAWS
EXPERT
answered 8 months ago
profile picture
EXPERT
reviewed 8 months ago
  • Hi Didier,

    Thank you very much for your quick and cost optimal solution.

    Hope this works for huge volume of data transfers also, lets say we might have 30GB transfer of data per each tenant for every 7 days so we have total 2000 tenants that means in a worst case we may transfer 60TB data for every 7 days.

    Actually we had other latency issue once data moved to S3, Glue Crawler taking time to index the s3 data. Hence like to use the CDC to live feeding to S3 and in parallel Glue Crawler index it as well. We can directly delete older than 7 days data from pg.

    Thanks, Baji

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions