Architecture choice for efficiently solving a compute problem

0

I have an array of around 75,000 rows. For each row, a calculation needs to be done (python script) which takes about an hour and will then produce about 1GB of data. The calculation must be run 10 times against each row using different parameters so in total it will be run 750,000 times

In no rush for the results and does not matter how long it takes to complete and get back all the data.

My question is what combination of AWS services can be used to efficiently solve this problem?

alexis
asked 7 months ago172 views
3 Answers
0

To efficiently solve a compute-intensive problem like this, where you need to perform calculations on a large dataset and generate substantial amounts of data, you can leverage several AWS services and architectural patterns. Here's a recommended approach using AWS services:

  1. Amazon EC2 Instances:

    • Use EC2 instances for running your Python scripts. Choose instance types that are well-suited for CPU-intensive tasks (e.g., compute-optimized instances). The instance type you select should depend on the resource requirements of your calculations.
    • You can create an EC2 Auto Scaling group to dynamically adjust the number of instances based on the workload.
  2. Amazon S3:

    • Store your input data (the array of 75,000 rows) and the output data (1GB results per calculation) in Amazon S3 buckets. This provides durability, scalability, and easy data management.
    • Divide your input data into manageable chunks and organize them in S3 for parallel processing.
  3. AWS Batch:

    • AWS Batch can help you manage and schedule the execution of your Python scripts in a distributed and efficient manner.
    • Define batch jobs for your calculations, specifying the input data location, output data location, and parameters.
    • AWS Batch can automatically scale the number of EC2 instances based on your job queue's size and priority.
  4. Amazon SQS (Optional):

    • If you want to decouple the submission of calculation jobs from their execution, you can use Amazon SQS to queue up jobs. Your main application can enqueue the jobs, and worker instances (EC2 instances) can poll the queue to retrieve and execute jobs.
  5. Amazon CloudWatch:

    • Use CloudWatch to monitor the performance and resource utilization of your EC2 instances and AWS Batch jobs. This can help you fine-tune your resources and optimize cost.
  6. AWS Lambda (Optional):

    • If you have any post-processing or data transformation tasks after the calculations are complete, you can trigger AWS Lambda functions based on events in your S3 buckets.
  7. Cost Optimization:

    • To optimize costs, consider using Spot Instances for your EC2 instances if you can tolerate interruptions. Spot Instances are often significantly cheaper than On-Demand instances.
  8. Data Organization:

    • Organize your data in S3 using a structure that makes it easy to track which calculations have been completed and which are pending. You may want to use prefixes or folders for this purpose.
  9. Logging and Error Handling:

    • Implement proper logging and error handling mechanisms in your Python scripts to ensure you can diagnose and troubleshoot issues efficiently.
  10. Security:

    • Ensure that your EC2 instances and S3 buckets are configured with the appropriate security settings, including IAM roles and policies to grant necessary permissions.

By architecting your solution with these AWS services and best practices, you can efficiently distribute and manage the computation of your calculations across a large dataset while keeping costs under control. Additionally, you have the flexibility to scale resources up or down as needed to meet your processing requirements.

answered 7 months ago
0

Hi,

the simplest is to use a serverless service that manages the scale-up and scale-down at no effort for you.

Given the fact that the computation lasts 1 hour, Lambdas are excluded (hard limit at 15 min). So, Fargate (serverless version of ECS) will be best. You will just have to wrap your Python script in a Docker image that will be run by Fargate.

See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-quotas.html, you can have up to 5'000 container instances per Fargate cluster.

You can overlay AWS Batch on top to ease the scheduling of your computations: see https://aws.amazon.com/batch/

Best,

Didier

profile pictureAWS
EXPERT
answered 7 months ago
0

You can use AWS glue (write your login and code in python) and store the data in S3, both services are serverless.

profile picture
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions