Skip to content

sagemaker inference toolkit reruns same task multiple times for one request

0

I followed AWS tutorials to deploy a model on Sagemaker, but when the server deployed takes several minutes to process a request, it runs the same processing multiple times for an invocation. I reported the issue here (https://github.com/aws/sagemaker-huggingface-inference-toolkit/issues/133). I also saw it in https://github.com/aws/amazon-sagemaker-examples/issues/1073 and this question, and believe the problem lies with Sagemaker inference tool kit.

Can you show me how to handle it? It's blocking my pipeline.

2 Answers
1
Accepted Answer

Since you're seeing predict_fn itself getting called multiple times (not just model_fn, which would be normal), it looks like something in your serving stack is triggering a retry of the inference request.

It's been a few versions since I've checked if this is up-to-date, but for HuggingFaceModel-based endpoints I'd suggest setting the MMS_DEFAULT_RESPONSE_TIMEOUT environment variable on your Model as used in this old example notebook.

This controls the request processing time-out of the serving stack in the container itself (in seconds) - and there are similar configurable imits to be aware of like request & response size MMS_MAX_REQUEST_SIZE and MMS_MAX_RESPONSE_SIZE - which could trip up your server even if SageMaker itself is okay with the payload and duration. These parameters come from the AWSLabs Multi-Model Server which the HF container (I believe still now) uses as the basis for serving.

However, do note that for real-time inference (as documented here) there are hard limits from SageMaker service side on your model responding within 60 seconds.

If you need longer, you'll probably want to deploy your model to an Asynchronous inference endpoint instead - but will still need to set the above environment variable(s) (as we did in the above linked notebook) to take advantage of the longer 60-minute time-out without MMS restarting the inference.

AWS
EXPERT
answered a year ago
  • Thank Alex T for your answer. I tried all the configurations you suggested but none worked. I was wondering why sagemaker inference toolkit hasn't fixed this supposed issue or explained what happened in their script. They're not active on the repositories at all, which discourages me from trying more with AWS Sagemaker.

  • Just to clarify (as mentioned a bit further down the answer but maybe not given enough priority), if you're trying to make a real-time inference endpoint process a request that takes longer than 60 seconds, then the env vars won't help because this time-out is enforced by the service... But if you're able to deploy an Async endpoint with these variables, then you should have more luck? Did you try with async also?

  • I deployed an async endpoint and it worked! It's a little bit less convenient for use than the realtime endpoint though. Thanks a lot for your suggestions.

-1

I apologize for the inconvenience you're experiencing with your SageMaker deployment. The issue you're describing, where the same task is being rerun multiple times for a single request, is indeed a known problem that has been reported by several users.

To address this issue, there are a few approaches you can consider:

  1. Implement request deduplication: You can add a unique identifier to each request and check if that identifier has been processed before. If it has, return the cached result instead of reprocessing.

  2. Use sticky session routing: Amazon SageMaker now supports sticky session routing, which ensures that all requests for the same session are routed to the same instance. This can help prevent multiple processing of the same request. To use this feature, you need to create a session ID with your first request and include it in subsequent requests.

  3. Optimize your model for faster inference: If possible, try to reduce the processing time of your model. This might involve model compression, quantization, or other optimization techniques.

  4. Increase the timeout settings: While this doesn't solve the root cause, increasing the timeout for your endpoint might allow the request to complete before a retry is triggered.

  5. Use SageMaker inference pipelines: If your processing involves multiple steps, consider using SageMaker inference pipelines. This allows you to chain multiple containers that process requests sequentially, which might help manage complex workflows more effectively.

  6. Monitor and log extensively: Implement detailed logging in your inference code to track exactly what's happening during each invocation. This can help identify where and why the repetitions are occurring.

  7. Consider using a custom inference container: If the issue persists and is indeed related to the SageMaker Hugging Face Inference Toolkit, you might want to consider creating a custom inference container that gives you more control over the inference process.

While these approaches may help mitigate the issue, it's important to note that this seems to be a recurring problem that AWS is aware of. You may want to continue following the GitHub issues you've mentioned and consider reaching out to AWS support directly for the most up-to-date guidance on resolving this issue.
Sources
Announcing sticky session routing for Amazon SageMaker Inference - AWS
Inference pipelines in Amazon SageMaker AI - Amazon SageMaker AI

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.