Serverless Inference - Limit number of workers

0

We've deployed a HuggingFace model to Sagemaker as a serverless endpoint. We set memory to be 6GB and max concurrency to be 1. With these settings, we keep getting errors when we call invoke_endpoint. Not all the time, but about 60% of the time...

When we check the logs and metrics, we see that the memory has gone up to almost 100%. We also see that, since the machine has 6 CPUs, if starts 6 workers. We believe this could be the cause of the problem. How can se set the number of workers?

Thanks!

已提问 2 年前1275 查看次数
2 回答
0

From “sagemaker.pytorch.model.PyTorchModeldocumentation:

model_server_workers (int) – Optional. The number of worker processes used by the inference server. If None, server will use one worker per vCPU.

You can see this example on how to set “MODEL_SERVER_WORKERS” environment variable to set number of workers.

env={
    "MODEL_SERVER_WORKERS":"2"
    }

local_regressor = Estimator(
    image,
    role,
    instance_count=1,
    instance_type="local")

train_location = 'file://'+local_train
validation_location = 'file://'+local_validation
local_regressor.fit({'train':train_location, 'validation': validation_location}, logs=True)

predictor = local_regressor.deploy(1, 'local', serializer=csv_serializer, env=env)

Hope it helps.

profile pictureAWS
已回答 2 年前
  • Thanks. Adding an "answer" to provide more information below...

0

Eitan, thanks for replying.

I'm not sure if this worked or not, as not the cloudwatch logs are not showing the number of workers anymore! The performance seems to be the same, however. It's failing more often than it's responding. And still reaching almost 100% memory.

Instead of your code, I used the following, as I'm deploying a Hugging Face model:

huggingface_model = HuggingFaceModel(
    name=model_name,
    model_data=os.path.join("s3://" + tar_bucket_name, tarfile_name),
    env={
        'HF_TASK': 'text-classification',
        'MODEL_SERVER_WORKERS': '1',
        'MODEL_SERVER_TIMEOUT': '300'
    },
    role=sagemaker.get_execution_role(),
    entry_point='inference.py',
    transformers_version='4.12.3',
    pytorch_version='1.9.1',
    py_version='py38'
)

Two follow up questions then, if you don't mind:

  1. How can I see if the serverless function actually created only one worker per instance?
  2. Where can I find all the different environment variables accepted by SageMaker?

Many thanks!

Rogerio

已回答 2 年前
  • Hi! I created the model (using CDK) with the environment variable SAGEMAKER_MODEL_SERVER_WORKERS. Maybe that makes the difference?

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则