Faster processing: EBS Vs S3


What is the best solution to read data by an application hosted in an EC2 instance using S3 than EBS? I am using an EC2 instance for reading data stored in EBS (size approx. 2 TB) and performing many transformations using ETL and analytics jobs. But as part of strict 3 tier architecture, there is a need to move this data from EBS (application layer) to data tier (preferably S3). My understanding is that if I move all these data permanently from EBS to S3, and read 2TB data from S3 daily for my jobs, the performance of jobs will be very low.

  1. Can you please suggest how can I achieve a better approach?
  2. Instead of S3, can I use any other service?
  3. The system is Linux system and hence I can't use Fsx
  4. I need the lightening performance for my jobs. Any help in this regard, will be appreciated.
1 Answer
Accepted Answer

AWS has multiple options for this kind of workload that can be used. Prescribing a solution is harder without having all the details regarding producers/consumers and other requirements. I till try to give you some light regarding a few options.

S3 is well suited to be a data lake. You will keep raw data there for processing somewhere. Usually, ETLs will spun up, download data from S3, process it and save in another datastore.

This second datastore will be the data warehouse (DW) where you have some data that has been processed and has some business value. From there it should be easier to run analytics jobs, because DW solutions are usually optimized for that kind of things (like Redshift).

As for speed, it depends on a bunch of factors.

  • Is your data spread in multiple files where you could process them in parallel?
  • Can you optimize the code?
  • Are you hitting CPU/memory/IO limits?
  • Is the download time (from S3) acceptable?

Sorry for not having a more prescriptive answer, but I hope that helps you a little bit.

profile picture
answered 17 days ago
  • Thanks for responding. Are you suggeting that the performance will be better if I move this data from EBS to S3?

  • No, that wasn't my intention. To make the performance faster you really should identify what is the bottleneck. It could be CPU, Memory, IO performance or even the EBS bandwidth. Performance is also not always tied to infrastructure, so having some visibility on the ETL itself can also give you some clues.

    S3 is suited for data retention, but the ETL will have to download data from there before being able to process it. The data is usually saved to an EBS disk and later loaded into memory for processing, but it depends on the ETL.

    So you can see that both EBS and S3 will be part of the whole process. The difference is that you should avoid using EBS for data retention, but you can also consider using provisioned IOPS for better storage performance on EBS.

  • Actually, I am referring to file system here.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions