Sagemaker batch jobs

0

I have an ML workload that involves providing predictions for large datasets on demand. The model is a PyTorch text classifier and the workload involves pricing predictions for 10's of thousands of records. The jobs arrive randomly but are very infrequent.

I've created a standard Sagemaker endpoint and performance is acceptable when I do client batching, i.e.

outputs = []
for batch in create_minibatch(inputs, batch_size=128):
    predictions = predictor.predict(batch)
    outputs.extend(predictions)

This takes around a minute for 25k records using a single instance.

I've considered using the AWS Batch mode, but it takes around 4-6 minutes to create the job, so any benefit from the ability to scale up to multiple instances seems to be lost due to the start-up cost. Is it possible to use Sagemaker Batch processing with a persistent endpoint?

The alternative is to use client batching (using the code above) - but if I create multiple instances can I be sure that each batch is returned in the order I request? In the example above I need to zip the inputs and outputs.

Is there a better way of serving this workload - I feel it falls somewhere in between the API and Batch paradigm?

Dave
asked a year ago408 views
2 Answers
0
Accepted Answer

Hi,

Please find my answers below.

Question: Is it possible to use Sagemaker Batch processing with a persistent endpoint?

Answer: Currently SageMaker does not have such option.

Question: The alternative is to use client batching (using the code above) - but if I create multiple instances can I be sure that each batch is returned in the order I request? In the example above I need to zip the inputs and outputs.

Answer: When you send each batch to your endpoint, the request is synchronous, and your application will be waiting to get a response right away. The response will be in the order you sent. So maintaining the order is a matter of how you manage your requests. Regardless of how many instances you use in the endpoint.

Question: Is there a better way of serving this workload - I feel it falls somewhere in between the API and Batch paradigm?

Answer:

There is two main issues at play here:

  1. Cost
  2. Time

Depending on your business need and which of the above is more important for you, or if you need to find a balance between the two.

SageMaker has the following inference options that might be useful for your case:

  1. Real-time inference: here you provision resources for the endpoint and it stays up and running, so when you need to do predictions you use it straight away with no wait. Cost is based time the endpoint was InService and number and type of instance (see pricing page).
  2. Batch Inference: Here you send large number of records that you want to get predictions for. It requires resources to be provisioned first, which can take few minutes before actual prediction happens. Cost will be calculated based on the time it took to do predictions.

Theoretically speaking, time took to do predictions in 1 or 2 should be very similar. However in 1, you provision the endpoint in advance, so when you invoke it, it feels faster, because your instance been there and ready. And comes with extra cost.

Now if this option is good for you, you can always provision the endpoint in advance before you start predictions, and tear it down after. This way you get the benefit from both worlds.

However, there is currently no option where the resource are persistent and you can just start predicting when you want. Just because of cost and operational considerations are in play here.

AWS
SUPPORT ENGINEER
Sam_E
answered a year ago
0

Thanks Sam that's really useful! I think we're going to use Batch Transform. Are you aware of any documentation on "Best Practices" to reduce the provisioning time? It seems to be taking between 3 and 6 minutes (which is ok but I feel it could be quicker). I'm using Pytorch and deploying a pre-trained model. So the image is '{account}.dkr.ecr.{region}.amazonaws.com/pytorch-inference:1.13.1-cpu-py39'

Because (I think) it builds the contain each time, and my model.tar.gz contains a requirement.txt file that downloads/installs huggingface transformers among other heavy dependencies it probably slows down deployment. Is it possible to reuse the container rather than have it built each time a new batch job is submitted? Or create an image based on pytorch-inference:1.13.1-cpu-py39 with the dependencies pre-installed?

Dave
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions