By default, does Sagemaker endpoint handles parallel requests?

0

When there are multiple concurrent InvokeEndpoint requests being called to a deployed AWS Sagemaker endpoint, how is it being handled?

I have deployed an endpoint with a P3.2xlarge instance. Currently one job takes around ~45 seconds to process. I have tried sending 4 different InvokeEndpoint requests at the same time and I noticed from CloudWatch logs that the jobs are being done serially depending on which request arrives first. --I suspect there is some sort of queue internally within the server model itself.--

I am aware of automatic scaling as described here: https://aws.amazon.com//blogs/machine-learning/load-test-and-optimize-an-amazon-sagemaker-endpoint-using-automatic-scaling/ but my question is by default does aws sagemaker not allow concurrent requests being handled at the same time?

UPDATE
Upon further investigation and testing here are some additional information.
I have deployed an ml.m4xlarge instance that simply sleeps for 45 seconds inside the transform function. Looks somewhat like

def transform_fn(model, request_body, content_type, accept_type):       
    request_body_dict = json.loads(request_body)
    time.sleep(45)
    ...

Furthermore, I have set the server timeout to be 420 seconds like so.

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                             role = role, 
                             entry_point = 'load_testing_entrypoint.py',
                             py_version='py3',
                             framework_version='1.4.1',
                             sagemaker_session = sagemaker_session,
                            env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '420' })

predictor = sagemaker_model.deploy(
                            initial_instance_count=1,
                            instance_type='ml.m4.xlarge',
                            endpoint_name='load-testing')

I tried sending 9 consecutive requests and monitored how they are being executed and what I've found is that there is no specific order in which the requests are being handled.

A few questions I have from this experiment is:

  1. Does AWS Sagemaker not process requests concurrently? Meaning, I would expect the server being able to handle two requests at the same time?
  2. From the client's side, how is it handling the case when the server is busy? I notice that it internally does retries for about 3 times after every 60 seconds if the request is not being handled
  3. Within each of the 60 seconds time window, how is the client code calling the Endpoint? Is it constantly calling after every 1,2,4,6,8 seconds ?

Here is the client side code

sagemaker_client = boto3.client('sagemaker-runtime')
response = sagemaker_client.invoke_endpoint(EndpointName='load-testing',Body=request_body)

Edited by: ptanugraha on Nov 14, 2019 10:44 AM

asked 4 years ago2989 views
1 Answer
0

Hello,

Thanks for trying SageMaker and our apologies for late response.

  1. Does AWS Sagemaker not process requests concurrently? Meaning, I would expect the server being able to handle two requests at the same time?
    Answer: SageMaker does process requests concurrently. We keep sending the requests to model container as we get them and does not enqueue. However, we do have throttling in place which can kick in if there are too many requests coming which the endpoint is not able to handle. In case of throttling you will get the error response immediately. Here it is possible that your model is processing the requests sequentially. I suggest please test your model container locally with concurrent requests.

  2. From the client's side, how is it handling the case when the server is busy? I notice that it internally does retries for about 3 times after every 60 seconds if the request is not being handled

  3. Within each of the 60 seconds time window, how is the client code calling the Endpoint? Is it constantly calling after every 1,2,4,6,8 seconds ?
    Answer: For these you can refer to aws sdk client configuration:
    https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html
    https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html
    To answer your question, api call will wait for the response. If it gets any exception or timeout then it will do the retry depending on the retry policy you set for the sdk client configuration.

Thanks

Edited by: harishataws on May 21, 2020 12:01 PM

AWS
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions