Access to model container logs for Sagemaker Async Endpoint

0

I'm using the NVidia Triton deep learning container. When I configure using the standard endpoint it works fine, the cloud-watch log group /aws/sagemaker/Endpoints/[EndpointName] contain the container logs (i.e. messages written to the console from the inference script).

But using async-inference all I get is a single [production-variant-name]/[instance-id]/data-log containing the inforation from the async queue, i.e.

2024-04-22T01:59:25.220:[sagemaker logs] [9d5880e2-74fc-431a-b659-c126454b5cc5] Inference request succeeded. ModelLatency: 2267959 us, RequestDownloadLatency: 433665 us, ResponseUploadLatency: 148004 us, TimeInBacklog: 680581 ms, TotalProcessingTime: 683482 ms

This makes it really hard to diagnose issues - how do I access the actual logs from the container when running in async mode?

asked a year ago460 views
3 Answers
1

Hello,

Thank you for using Amazon SageMaker.

At the moment, [production-variant-name]/[instance-id]/data-log are all the logs provided by Amazon SageMaker for asynchronous endpoints.

I have raised a feature request on your behalf to include the model container logs for async endpoints. While I am unable to comment on if/when this feature may get released, I request you to keep an eye on our What's New and Blog pages for any new feature announcements.

AWS
SUPPORT ENGINEER
answered a year ago
1

I have active asynchronous inference endpoints for which both [production-variant-name]/[instance-id] (the endpoint logs) and [production-variant-name]/[instance-id]/data-log (the queue orchestration logs) are present, so believe the other answer cannot be correct... However, my endpoints aren't using the Triton container.

I'd suggest to double-check for possible permissions errors that could be preventing your endpoint from creating the relevant CW log streams, exploring any other factors that might be hiding logs (e.g. configured log level? Other Triton settings?), and maybe setting the PYTHONUNBUFFERED environment variable + adding some additional print() statements if possible to be sure?

AWS
EXPERT
answered a year ago
0

Thanks @Marta_M, @Alex_T I followed your suggestions - but I'm using the exact same model invoked as batch Transform and I get the [production-variant-name]/[instance-id]/data-log when calling create_transform_job but not invoke_endpoint_async and I don't see anywhere obvious where one mode could set the logging level different to the other.

One additional friction points I've encountered is sagemaker in general doesn't appear to log anything written to stdout/stderr until the endpoint is in service - which means any messages created by say an entrypoint.sh script that starts the server in the container isn't captured anywhere that I've seen.

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions