Using Sagemaker Triton with Async Endpoint and Binary Data

0

I've built a Triton container and I'd like to deploy it as an Async Endpoint that's invoked nightly. I have it working ok with AutoScaling and I can invoke it fine using application/json.

Its a lot slower than using binary_data though, i.e. I can create the request as follows

text = tritonclient.http.InferInput('text', [len(test_data)], "BYTES")
text.set_data_from_numpy(np.array(test_data, dtype=object).reshape(text.shape()), binary_data=True)

labels = tritonclient.http.InferRequestedOutput('labels',  binary_data=True)
scores = tritonclient.http.InferRequestedOutput('scores',  binary_data=True)

# Need to create body then use sagemaker client to send rather than tritonclient directly
request_body, header_length = tritonclient.http.InferenceServerClient.generate_request_body(
        inputs=[text], outputs=[labels, scores]
)

with open("examples/request.bin","wb") as f:
    f.write(request_body)

I can copy this to s3 and invoke the endpoint and get the response back no problem,

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName=endpoint_name, 
    InputLocation="s3://data-science.cimenviro.com/models/triton-serve/input/request.bin",
    ContentType=f'application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}')

output_location = response['OutputLocation']

The issue is in order to parse the response, I need to access the json-header-size from the response ContentType - but because Sagemaker invokes the endpoint it's not available. The response from sagemaker_runtime.invoke_endpoint_async is not the response from invoking the actual model endpoint as it's not been called at this stage. So I cannot reliably extract the response and have to fall back to binary_data=False. i.e. the contents of the response are:

b'{"model_name":"ensemble","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false,"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"scores","datatype":"FP32","shape":[1,10],"parameters":{"binary_data_size":40}},{"name":"labels","datatype":"INT64","shape":[1,10],"parameters":{"binary_data_size":80}}]}\x05\xa1v?\xc3\x13\xb6;\x15EX;X!!;\x1eE\x05;\xfa\xbc\x83:\xcbah:.\x9ba:\xd0\xdbI:\xdc\x0c0:w\x01\x00\x00\x00\x00\x00\x00\xb2\x01\x00\x00\x00\x00\x00\x00U\x00\x00\x00\x00\x00\x00\x00E\x02\x00\x00\x00\x00\x00\x00\xc7\x03\x00\x00\x00\x00\x00\x00\x8a\x01\x00\x00\x00\x00\x00\x00}\x00\x00\x00\x00\x00\x00\x00z\x01\x00\x00\x00\x00\x00\x004\x00\x00\x00\x00\x00\x00\x005\x03\x00\x00\x00\x00\x00\x00'

I need the json-header-size to read the json and then the tensors, Is this supported or do I have to use JSON?

Dave
asked 8 months ago355 views
1 Answer
0

Hi,

Thank you for using AWS Sagemaker.

For this question:

  • Firstly, yes Async Inference will not response back the actual result because the Async processing option means it will produce the result at backend in longer duration. And then push the output to the output S3 file you have defined.

  • For the Content-Type information, you can actually get that from S3 output object's metadata: https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html. S3 output object will have Content-Type in its metadata and We think for this case, you can use it to determine the json-header-size.

To further understand the issue more in depth as I have limited visibility on your setup, I'd recommend you to reach to AWS Support by creating a support case[+] so that the engineer can investigate further and help you overcome the issue.

Reference: [+] Open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

AWS
answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions