As you're seeing in the error message, SageMaker Serverless Inference imposes a limit of 10GiB (10737418240 bytes) on your deployed container size - which helps deliver quality of service for considerations like cold-start time. From a quick look I didn't see this mentioned in the SageMaker serverless docs, but as mentioned in the launch blog post, SageMaker Serverless is backed by AWS Lambda and the AWS Lambda quotas page lists the limit.
So to solve the issue (and still use SageMaker Serverless Inference), you'll need to look at optimizing that container image size by removing any unnecessary bloat (need to find almost 2GiB from the number you posted).
Some suggestions on that:
- Are you currently building your actual model in to the image itself? The typical pattern on SageMaker is to host a
model.tar.gztarball on S3, which gets downloaded and extracted into your container at runtime. For large language models and similar, this can be a big size saving (although of course, optimizing overall S3+image size can still help give you the best start-up times). The contents of this file are flexible so you could offload multiple artifacts.
- I saw you're using the standard PyTorch DLC as a base... Are you replacing the entire serving stack, or slotting your custom logic into the one the DLC provides? The stack already provided in the PyTorch container already provides (see docs here) customization to model loading via
model_fn, input de-serialization via
input_fn, output serialization via
output_fn, and actual prediction via
predict_fn. The APIs between these user-defined functions are very flexible (for example can return pretty much whatever you like from
model_fn, so long as
predict_fnknows how to use it) - so I find in practice that it can support even complex requirements like custom request formats, pipelining multiple models together, advanced pre-processing, etc. I've seen some customers go straight to building custom serving stacks (and installing their dependencies alongside the existing e.g. TorchServe in the image) before realising that the pre-built could already support what they needed. Again, this
inference.pyscript would live in your
- General non-SageMaker-specific container image optimization guidelines would still apply: Like for e.g. you might see the AWS DLCs clearing apt caches in the same
RUNcommand as performing apt installs. If you find yourself really struggling with the size of the base AWS DLC you could look in to building from scratch / another base, and installing everything you need... But of course, would need to do the due diligence to check you're including everything you need & it's optimized well.
You need a smaller container image. Also, take into consideration that at the moment SageMaker serverless endpoints do not support GPU acceleration (see https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html#serverless-endpoints-how-it-works-exclusions).
How to define concurrency in SageMaker real-time inferenceasked 6 months ago
How to check/determine image/container size for aws managed images ?
what is the model(transformer) size limitation in sagemaker serverless endpoint deployment?
How to create a serverless endpoint configuration?Accepted Answer
"Failure reason Image size 12704675783 is greater than supported size 10737418240" when creating serverless endpoint in SageMaker.asked a month ago
How to configure a serverless sagemaker endpoint?
How to create a serverless endpoint in sagemaker?
Why I can't create a serverless sagemaker endpoint anymore ?Accepted Answerasked 4 months ago
not able to add sagemaker dependencies as external dependencies to lambdaAccepted Answerasked 3 years ago
How to create (Serverless) SageMaker Endpoint using exiting tensorflow pb (frozen model) file?asked 4 months ago