1 Answer
- Newest
- Most votes
- Most comments
0
Hi,
We use Sagemaker model endpoints on a constant basis with multiple parallel queries all the time to increase our global inference throughput: it works fine.
You need to disclose more of your setup to obtain more detailled support.
In the meantime, you can read this pair of blog posts to see what's possible: https://aws.amazon.com/blogs/machine-learning/scale-foundation-model-inference-to-hundreds-of-models-with-amazon-sagemaker-part-1/
Best,
Didier
Relevant content
- asked a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 8 months ago
Hi, comment 1/2 Thanks for the reply. Here are more details about the deployment process:
This is the inference.py: def model_fn(model_dir, contex=None): start_time = time.time() device = "cuda" if torch.cuda.is_available() else "cpu" logger.info(device) model = custome_model_load() model.to(device=device)
return model def input_fn(request_body, request_content_type= 'application/json', contex=None): handle input def predict_fn(input_data, model, contex=None): return model(input_data) def output_fn(predictions, content_type="application/json", contex=None): return json.dump("mask":predictions)
Hi, comment 2/2 Thanks for the reply.
Here are more details about the deployment process:
This is the deployment code: from sagemaker import get_execution_role from sagemaker.pytorch import PyTorchModel role = get_execution_role()
pytorch_model = PyTorchModel(model_data= 's3://' + sagemaker_session.default_bucket() + '/model30/model.tar.gz', role=role, entry_point='inference.py', framework_version = '2.1',py_version='py310')
predictor = pytorch_model.deploy(instance_type='ml.p3.2xlarge', initial_instance_count=1)
Hi, use this doc page to see how to implement your endpoint properly in order to accept parallel inferences: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
Hi Didier:
Thank you for sharing the doc. Upon review I didn't find a section that specifically addresses parallel inference. Can you please provide a more specific solution?
Thank you for your help.