By using AWS re:Post, you agree to the Terms of Use
/Advice for best database/datastorage for historical data/

Advice for best database/datastorage for historical data



I´m doing some reasearch to find the best place to centralize lots of data logs generated by my application considering pricing ,performance and scalabilty. Today all my application data including logs are stored on an Oracle database, but I´m thinking to move all the LOG related data outside it to reduce it´s size and not to worry about storage performance etc...

Just put everything on a "infinite" storage apart from my actual database using CDC or a regular batch process

Below are some needs:

  • Only inserts are necessary (no updates or deletes)
  • Customers will need access to this historic data
  • Well defined pattern of access (one or two indexes at maximum)
  • Latencies of few seconds is ok
  • Avoid infrastrucure, DBA, perfomance bottleneck log term...
  • Infinite Retentiton period (means I don´t want to worry about performance issues, storage size in long term. But something that can handle a few terabytes of data )

Use case example: Historical Sales order by items ( id_item | id_customer | qty_sold | date_inserted ... ), aprox 50 millions records per day Where I would need to see the historical data by item, and by customer for example (two dimensions)

I´ve done some research with the options below

S3 + Athena -> Put everthing on s3, no worries about infrastructure perfomance issues, however as I need query by item and customer, probably it´would be necessary to break files by item or customer , generate millions of partitions to avoid high costs searching on every file etc..

Postgre -> Not sure if could be performance bottleneck once tables gets too big even with partition strategies

DynamoDB -> Not sure if it´s a good alternative to historical data regarding pricing once seconds latency is ok

MongoDB/ DocumentDB -> Not very familiar with it (I´d prefer SQL language type) but I know it´s has a good scalability

Cassandra-> dont´know very much

Timeseries db as influxDB, timestream etc..-> dont´know very much, but it seems appropriate for timeseries

What option would you choose ?

Sorry in advance if I saying something wrong or impossible :)

Thank you!

1 Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions