AWS sagemaker endpoints dont accept concurrent calls?

0

I have a pytorch model deployed on p3.2xlarge, connected to lambda and api gateway.

When multiple requests are sent to it, based on logs, it accepts them sequentially. There is no concurrency. But Shouldn't endpoints be able to handle like 200 calls concurrently?

Do we need to set something up?

Please let me know.

This is a sample concurrent call:

import threading import requests

def send_request(): data = {"image": encoded_image} response = requests.post("https://idr263lxri.something", json=data) print(response.text)

threads = [] for _ in range(8): thread = threading.Thread(target=send_request) thread.start() threads.append(thread)

for thread in threads: thread.join() I tried various concurrent calling and they are handled sequentially.

I used various concurrency calls same thing.

It doesnt make sense that an endpoint only serves 1 call at a time....

Are we missing something?

Thank you

1개 답변
0

Hi,

We use Sagemaker model endpoints on a constant basis with multiple parallel queries all the time to increase our global inference throughput: it works fine.

You need to disclose more of your setup to obtain more detailled support.

In the meantime, you can read this pair of blog posts to see what's possible: https://aws.amazon.com/blogs/machine-learning/scale-foundation-model-inference-to-hundreds-of-models-with-amazon-sagemaker-part-1/

Best,

Didier

profile pictureAWS
전문가
답변함 3달 전
  • Hi, comment 1/2 Thanks for the reply. Here are more details about the deployment process:

    This is the inference.py: def model_fn(model_dir, contex=None): start_time = time.time() device = "cuda" if torch.cuda.is_available() else "cpu" logger.info(device) model = custome_model_load() model.to(device=device)
    return model def input_fn(request_body, request_content_type= 'application/json', contex=None): handle input def predict_fn(input_data, model, contex=None): return model(input_data) def output_fn(predictions, content_type="application/json", contex=None): return json.dump("mask":predictions)

  • Hi, comment 2/2 Thanks for the reply.

    Here are more details about the deployment process:

    This is the deployment code: from sagemaker import get_execution_role from sagemaker.pytorch import PyTorchModel role = get_execution_role()

    pytorch_model = PyTorchModel(model_data= 's3://' + sagemaker_session.default_bucket() + '/model30/model.tar.gz', role=role, entry_point='inference.py', framework_version = '2.1',py_version='py310')

    predictor = pytorch_model.deploy(instance_type='ml.p3.2xlarge', initial_instance_count=1)

  • Hi, use this doc page to see how to implement your endpoint properly in order to accept parallel inferences: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html

  • Hi Didier:

    Thank you for sharing the doc. Upon review I didn't find a section that specifically addresses parallel inference. Can you please provide a more specific solution?

    Thank you for your help.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠