Yolo5 model deployment into SageMaker endpoint


I have a trained yolo5 model which I deployed into Real-time end-point in SageMaker. I tried almost all gpus computes available but I could not get better than 2.6 sec for inference time. This is a light model and my target is to have < 1sec inference time. Could you please help me with any hints? I transformed the model from pytorch to TF format. Thank you!

  • Probably over-simple but worth checking: What's your input request format? Are you sending in compressed images (e.g. 'image/jpeg') and extracting them in the endpoint? Or sending in raw pixel tensors (which could add significant communication overhead)

1 Answer

Have you read the following two blog posts?

  1. Maximize TensorFlow performance on Amazon SageMaker endpoints for real-time inference
  2. Reduce computer vision inference latency using gRPC with TensorFlow serving on Amazon SageMaker?

The first explores a few parameters that you can use to maximize the performance of a TensorFlow-based SageMaker real-time endpoint. These parameters are in essence overprovisioning serving processes and adjusting their parallel processing capabilities. As we saw in the tables, this overprovisioning and adjustment leads to better utilization of resources and higher throughput, sometimes an increase as much as 1,000%.

The second demonstrates how to reduce model serving latency for TensorFlow computer vision models on SageMaker via in-server gRPC communication, leading to gains of 75% in latency improvement in the example shown.

answered 3 months ago
  • Thanks @MarkRoy for the help! I tried the suggestions but nothing worked and I am not able to get the inference in less than 2.6 sec. Any other suggestions? Thank you!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions