AWS sagemaker endpoints dont accept concurrent calls?

0

I have a pytorch model deployed on p3.2xlarge, connected to lambda and api gateway.

When multiple requests are sent to it, based on logs, it accepts them sequentially. There is no concurrency. But Shouldn't endpoints be able to handle like 200 calls concurrently?

Do we need to set something up?

Please let me know.

This is a sample concurrent call:

import threading import requests

def send_request(): data = {"image": encoded_image} response = requests.post("https://idr263lxri.something", json=data) print(response.text)

threads = [] for _ in range(8): thread = threading.Thread(target=send_request) thread.start() threads.append(thread)

for thread in threads: thread.join() I tried various concurrent calling and they are handled sequentially.

I used various concurrency calls same thing.

It doesnt make sense that an endpoint only serves 1 call at a time....

Are we missing something?

Thank you

1 回答
0

Hi,

We use Sagemaker model endpoints on a constant basis with multiple parallel queries all the time to increase our global inference throughput: it works fine.

You need to disclose more of your setup to obtain more detailled support.

In the meantime, you can read this pair of blog posts to see what's possible: https://aws.amazon.com/blogs/machine-learning/scale-foundation-model-inference-to-hundreds-of-models-with-amazon-sagemaker-part-1/

Best,

Didier

profile pictureAWS
专家
已回答 3 个月前
  • Hi, comment 1/2 Thanks for the reply. Here are more details about the deployment process:

    This is the inference.py: def model_fn(model_dir, contex=None): start_time = time.time() device = "cuda" if torch.cuda.is_available() else "cpu" logger.info(device) model = custome_model_load() model.to(device=device)
    return model def input_fn(request_body, request_content_type= 'application/json', contex=None): handle input def predict_fn(input_data, model, contex=None): return model(input_data) def output_fn(predictions, content_type="application/json", contex=None): return json.dump("mask":predictions)

  • Hi, comment 2/2 Thanks for the reply.

    Here are more details about the deployment process:

    This is the deployment code: from sagemaker import get_execution_role from sagemaker.pytorch import PyTorchModel role = get_execution_role()

    pytorch_model = PyTorchModel(model_data= 's3://' + sagemaker_session.default_bucket() + '/model30/model.tar.gz', role=role, entry_point='inference.py', framework_version = '2.1',py_version='py310')

    predictor = pytorch_model.deploy(instance_type='ml.p3.2xlarge', initial_instance_count=1)

  • Hi, use this doc page to see how to implement your endpoint properly in order to accept parallel inferences: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html

  • Hi Didier:

    Thank you for sharing the doc. Upon review I didn't find a section that specifically addresses parallel inference. Can you please provide a more specific solution?

    Thank you for your help.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则