AWS sagemaker endpoints dont accept concurrent calls?

0

I have a pytorch model deployed on p3.2xlarge, connected to lambda and api gateway.

When multiple requests are sent to it, based on logs, it accepts them sequentially. There is no concurrency. But Shouldn't endpoints be able to handle like 200 calls concurrently?

Do we need to set something up?

Please let me know.

This is a sample concurrent call:

import threading import requests

def send_request(): data = {"image": encoded_image} response = requests.post("https://idr263lxri.something", json=data) print(response.text)

threads = [] for _ in range(8): thread = threading.Thread(target=send_request) thread.start() threads.append(thread)

for thread in threads: thread.join() I tried various concurrent calling and they are handled sequentially.

I used various concurrency calls same thing.

It doesnt make sense that an endpoint only serves 1 call at a time....

Are we missing something?

Thank you

1 Answer
0

Hi,

We use Sagemaker model endpoints on a constant basis with multiple parallel queries all the time to increase our global inference throughput: it works fine.

You need to disclose more of your setup to obtain more detailled support.

In the meantime, you can read this pair of blog posts to see what's possible: https://aws.amazon.com/blogs/machine-learning/scale-foundation-model-inference-to-hundreds-of-models-with-amazon-sagemaker-part-1/

Best,

Didier

profile pictureAWS
EXPERT
answered 3 months ago
  • Hi, comment 1/2 Thanks for the reply. Here are more details about the deployment process:

    This is the inference.py: def model_fn(model_dir, contex=None): start_time = time.time() device = "cuda" if torch.cuda.is_available() else "cpu" logger.info(device) model = custome_model_load() model.to(device=device)
    return model def input_fn(request_body, request_content_type= 'application/json', contex=None): handle input def predict_fn(input_data, model, contex=None): return model(input_data) def output_fn(predictions, content_type="application/json", contex=None): return json.dump("mask":predictions)

  • Hi, comment 2/2 Thanks for the reply.

    Here are more details about the deployment process:

    This is the deployment code: from sagemaker import get_execution_role from sagemaker.pytorch import PyTorchModel role = get_execution_role()

    pytorch_model = PyTorchModel(model_data= 's3://' + sagemaker_session.default_bucket() + '/model30/model.tar.gz', role=role, entry_point='inference.py', framework_version = '2.1',py_version='py310')

    predictor = pytorch_model.deploy(instance_type='ml.p3.2xlarge', initial_instance_count=1)

  • Hi, use this doc page to see how to implement your endpoint properly in order to accept parallel inferences: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html

  • Hi Didier:

    Thank you for sharing the doc. Upon review I didn't find a section that specifically addresses parallel inference. Can you please provide a more specific solution?

    Thank you for your help.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions