- Newest
- Most votes
- Most comments
As you're seeing in the error message, SageMaker Serverless Inference imposes a limit of 10GiB (10737418240 bytes) on your deployed container size - which helps deliver quality of service for considerations like cold-start time. From a quick look I didn't see this mentioned in the SageMaker serverless docs, but as mentioned in the launch blog post, SageMaker Serverless is backed by AWS Lambda and the AWS Lambda quotas page lists the limit.
So to solve the issue (and still use SageMaker Serverless Inference), you'll need to look at optimizing that container image size by removing any unnecessary bloat (need to find almost 2GiB from the number you posted).
Some suggestions on that:
- Are you currently building your actual model in to the image itself? The typical pattern on SageMaker is to host a
model.tar.gz
tarball on S3, which gets downloaded and extracted into your container at runtime. For large language models and similar, this can be a big size saving (although of course, optimizing overall S3+image size can still help give you the best start-up times). The contents of this file are flexible so you could offload multiple artifacts. - I saw you're using the standard PyTorch DLC as a base... Are you replacing the entire serving stack, or slotting your custom logic into the one the DLC provides? The stack already provided in the PyTorch container already provides (see docs here) customization to model loading via
model_fn
, input de-serialization viainput_fn
, output serialization viaoutput_fn
, and actual prediction viapredict_fn
. The APIs between these user-defined functions are very flexible (for example can return pretty much whatever you like frommodel_fn
, so long aspredict_fn
knows how to use it) - so I find in practice that it can support even complex requirements like custom request formats, pipelining multiple models together, advanced pre-processing, etc. I've seen some customers go straight to building custom serving stacks (and installing their dependencies alongside the existing e.g. TorchServe in the image) before realising that the pre-built could already support what they needed. Again, thisinference.py
script would live in yourmodel.tar.gz
. - General non-SageMaker-specific container image optimization guidelines would still apply: Like for e.g. you might see the AWS DLCs clearing apt caches in the same
RUN
command as performing apt installs. If you find yourself really struggling with the size of the base AWS DLC you could look in to building from scratch / another base, and installing everything you need... But of course, would need to do the due diligence to check you're including everything you need & it's optimized well.
You need a smaller container image. Also, take into consideration that at the moment SageMaker serverless endpoints do not support GPU acceleration (see https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html#serverless-endpoints-how-it-works-exclusions).
Relevant content
- asked 2 months ago
- Accepted Answerasked 10 days ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 3 years ago
Thanks for your reply. Yes, what I want to do is just wirte our own "model_fn", "predict_fn" functions. You said these function should be in inference.py and in model.tar.gz. But I only found the document https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html which said these function should write to dock container. Is there any document about the structure of model.tar.gz file? and which file in model.tar.gz will be run? Thank you.