SageMaker Inference Endpoint timeout value?

0

Hello,

I am trying to deploy this github solution but do not have access to a ml.g5.12xlarge and I am hoping to be able to run it on a ml.g5.4xlarge. Based on the error I am getting (see below dump showing timeout = 60s) I am wondering if there is some sort of timeout variable that I can set when I create the the sagemaker endpoint to increase how long it waits for the model to respond to the query.

PS: I am pretty sure the issue is not related to this post here

Thank you,

---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/urllib3/connectionpool.py:467, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    463         except BaseException as e:
    464             # Remove the TypeError from the exception chain in
    465             # Python 3 (including for exceptions like SystemExit).
    466             # Otherwise it looks like a bug in the code.
--> 467             six.raise_from(e, None)
    468 except (SocketTimeout, BaseSSLError, SocketError) as e:


TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError                          Traceback (most recent call last)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/httpsession.py:464, in URLLib3Session.send(self, request)
    463 request_target = self._get_request_target(request.url, proxy_url)
--> 464 urllib_response = conn.urlopen(
    465     method=request.method,
    466     url=request_target,
    467     body=request.body,
    468     headers=request.headers,
    469     retries=Retry(False),
    470     assert_same_host=False,
    471     preload_content=False,
    472     decode_content=False,
    473     chunked=self._chunked(request.headers),
    474 )
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/urllib3/connectionpool.py:358, in HTTPConnectionPool._raise_timeout(self, err, url, timeout_value)
    357 if isinstance(err, SocketTimeout):
--> 358     raise ReadTimeoutError(
    359         self, url, "Read timed out. (read timeout=%s)" % timeout_value
    360     )
    362 # See the above comment about EAGAIN in Python 3. In Python 2 we have
    363 # to specifically catch it and throw the timeout error

**ReadTimeoutError: AWSHTTPSConnectionPool(host='runtime.sagemaker.us-east-1.amazonaws.com', port=443): 
Read timed out. (read timeout=60)**

During handling of the above exception, another exception occurred:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/httpsession.py:501, in URLLib3Session.send(self, request)
    500 except URLLib3ReadTimeoutError as e:
--> 501     raise ReadTimeoutError(endpoint_url=request.url, error=e)
    502 except ProtocolError as e:

ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/aws-genai-mda-blog-flan-t5-xxl-endpoint-6ecf4020/invocations"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[14], line 20
      7 query = "How many covid cases are there in the state of NY"
**---> 20 response = run_query(query)**
     21 print("----------------------------------------------------------------------")
     22 print(f'SQL and response from user query {query}  \n  {response}')

Cell In[13], line 51, in run_query(query)
**---> 51 channel, db = identify_channel(query) **
1 Answer
0

Hi,

Please, checkout this page for available timeout parameters to configure: https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-hosting.html

Also, you may want to adapt the values of env vars SAGEMAKER_TS_RESPONSE_TIMEOUTand SAGEMAKER_MODEL_SERVER_TIMEOUT inside your container model instance (if they exist for this specific model) to increase the timeout value.

Best,

Didier

profile pictureAWS
EXPERT
answered 2 months ago
  • Thank your for the insights. I added values to the timeout variables you mentioned, but still got similar errors. I also tried a couple of different huggingface models and deep learning containers (DLC) that are supposed to be faster performing but it seemed to make very little difference. Do you think using a smaller model flan-t5-xl is worth trying out?

    Just to summarize the models used and errors I am getting: Model: Flan-T5-XXL DLC: pytorch-inference:1.12.0-gpu-py38
    Error: ReadTimeoutError: AWSHTTPSConnectionPool(host='runtime.sagemaker.us-east-1.amazonaws.com', port=443): Read timed out. (read timeout=60)**

    Model: Flan-T5-XXL-FP16 DLC: huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04 Error raised by inference endpoint: An error occurred (ModelError) when calling the InvokeEndpoint operation: "InternalServerException", {"message": "addmm_impl_cpu_" not implemented for Half"}

    Model: Flan-T5-XXL-BNB-INT8 DLC: pytorch-inference:1.12.0-gpu-py38 Error: ReadTimeoutError: AWSHTTPSConnectionPool(host='runtime.sagemaker.us-east-1.amazonaws.com', port=443): Read timed out. (read timeout=60)**

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions