I've built a Triton container and I'd like to deploy it as an Async Endpoint that's invoked nightly. I have it working ok with AutoScaling and I can invoke it fine using application/json
.
Its a lot slower than using binary_data though, i.e. I can create the request as follows
text = tritonclient.http.InferInput('text', [len(test_data)], "BYTES")
text.set_data_from_numpy(np.array(test_data, dtype=object).reshape(text.shape()), binary_data=True)
labels = tritonclient.http.InferRequestedOutput('labels', binary_data=True)
scores = tritonclient.http.InferRequestedOutput('scores', binary_data=True)
# Need to create body then use sagemaker client to send rather than tritonclient directly
request_body, header_length = tritonclient.http.InferenceServerClient.generate_request_body(
inputs=[text], outputs=[labels, scores]
)
with open("examples/request.bin","wb") as f:
f.write(request_body)
I can copy this to s3 and invoke the endpoint and get the response back no problem,
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName=endpoint_name,
InputLocation="s3://data-science.cimenviro.com/models/triton-serve/input/request.bin",
ContentType=f'application/vnd.sagemaker-triton.binary+json;json-header-size={header_length}')
output_location = response['OutputLocation']
The issue is in order to parse the response, I need to access the json-header-size
from the response ContentType - but because Sagemaker invokes the endpoint it's not available. The response from sagemaker_runtime.invoke_endpoint_async is not the response from invoking the actual model endpoint as it's not been called at this stage. So I cannot reliably extract the response and have to fall back to binary_data=False. i.e. the contents of the response are:
b'{"model_name":"ensemble","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false,"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"scores","datatype":"FP32","shape":[1,10],"parameters":{"binary_data_size":40}},{"name":"labels","datatype":"INT64","shape":[1,10],"parameters":{"binary_data_size":80}}]}\x05\xa1v?\xc3\x13\xb6;\x15EX;X!!;\x1eE\x05;\xfa\xbc\x83:\xcbah:.\x9ba:\xd0\xdbI:\xdc\x0c0:w\x01\x00\x00\x00\x00\x00\x00\xb2\x01\x00\x00\x00\x00\x00\x00U\x00\x00\x00\x00\x00\x00\x00E\x02\x00\x00\x00\x00\x00\x00\xc7\x03\x00\x00\x00\x00\x00\x00\x8a\x01\x00\x00\x00\x00\x00\x00}\x00\x00\x00\x00\x00\x00\x00z\x01\x00\x00\x00\x00\x00\x004\x00\x00\x00\x00\x00\x00\x005\x03\x00\x00\x00\x00\x00\x00'
I need the json-header-size
to read the json and then the tensors, Is this supported or do I have to use JSON?