Which service is suitable for me?


I have a project where I would like to send inference requests. For this I need a API as AWS Lambda or a SageMaker endpoint so that the customer can send their request there.

The inference performed in the AWS Cloud requires higher processing power, so I need a GPU. As an example, I calculate the inference on my PC's GPU and it takes a few seconds, but on my CPU it takes 2 minutes. I also need at least 6 MB of VRAM/RAM, although more would be better because I process images and I could process better ones.

Now I have looked at some options on AWS. Since AWS Lambda does not support GPU, I looked at SageMaker inference. There are real-time and asynchronous endpoints. Setting up a real-time point covers my requirements, but I have the problem that the instance then runs all the time. However, especially at the beginning, I only have a few requests per day and it doesn't make sense for me to pay for 24 hours. So I would need something that shuts down and quickly boots up again as soon as there are requests.

That's why I tried to set up an asynchronous point because it is possible to shut it down and start it up again. However, now I have the problem that I get the error message that my query has an incorrect inference type. However, I don't understand how this works with asynchronous requests. I send my requests from client-side Javascript to the API Gateway, which forwards the request to the Sagemaker endpoint. Regarding asynchronous requests, I read that you have to use event-based requests. Likewise, I read that the response is stored in S3 instead of being sent directly to the user as visualized here: https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html

I need a solution in which I send the response directly to the user and it takes a maximum of 10 seconds (for the first request until the instance is booted, it may take a little longer, but not longer than a minute). Is this even possible with asynchronous endpoints? I'd rather ask here before I delve into this topic and this option isn't suitable either.

On the other hand, I read about AWS Inferentia. Would this be an option for my use case?

I am very happy about your answers.

Best regards Paul

asked 7 months ago710 views
1 Answer

Hi thanks for reaching out! Let me try to help you with this.

Inferentia is the recommended instance type for inference on AWS and you can use it with Sagemaker and model auto scaling[2], but as you said, you'd need to have at least one instance running all the time, unless you do endpoint lifecycle, what would take take more than 10 seconds to bring the endpoint back when you need it (more likely a few minutes).

If you use Api Gateway + Lambda to call the async Sagemaker Endpoint, it would also take minutes for the job to be completed and you would need to send the results back to the customer using another method and it's not what you are looking for.

So I see the below options based on your use case:

  1. If you have a custom model (and really needs a custom model) you can go with Sagemaker + Inferentia and use the most effective instance size with auto scaling to scale out if needed (but you'd need at least one instance running at all times).

  2. If you can use a foundation model (which can be fine-tuned), you can have a look at Amazon Bedrock[3], which is a managed service where you pay as you go for inferences/api requests and supports several foundation models.

I hope this helps.

[1] - https://aws.amazon.com/machine-learning/inferentia/

[2] - https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

[3] - https://aws.amazon.com/bedrock/

Steve T
answered 7 months ago
  • Hi Steve, thank you really much for your answer!

    It's sobering to hear that the approach as I imagined is not possible.

    Today I read about warm pools, where the instances can be left pre-initialized in a stopped state so that they start faster. Wouldn't this option be conceivable for my use case or isn't this suitable for asynchronous endpoints?

    The most unfavorable thing is, that I really don't have many requests in the beginning. So it is strictly not worth to host a gpu for the whole time. A serverless soltion would be nice but I am depenend of the gpu..

    Best regards

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions