Serverless Inference - Limit number of workers

0

We've deployed a HuggingFace model to Sagemaker as a serverless endpoint. We set memory to be 6GB and max concurrency to be 1. With these settings, we keep getting errors when we call invoke_endpoint. Not all the time, but about 60% of the time...

When we check the logs and metrics, we see that the memory has gone up to almost 100%. We also see that, since the machine has 6 CPUs, if starts 6 workers. We believe this could be the cause of the problem. How can se set the number of workers?

Thanks!

已提問 2 年前檢視次數 1274 次
2 個答案
0

From “sagemaker.pytorch.model.PyTorchModeldocumentation:

model_server_workers (int) – Optional. The number of worker processes used by the inference server. If None, server will use one worker per vCPU.

You can see this example on how to set “MODEL_SERVER_WORKERS” environment variable to set number of workers.

env={
    "MODEL_SERVER_WORKERS":"2"
    }

local_regressor = Estimator(
    image,
    role,
    instance_count=1,
    instance_type="local")

train_location = 'file://'+local_train
validation_location = 'file://'+local_validation
local_regressor.fit({'train':train_location, 'validation': validation_location}, logs=True)

predictor = local_regressor.deploy(1, 'local', serializer=csv_serializer, env=env)

Hope it helps.

profile pictureAWS
已回答 2 年前
  • Thanks. Adding an "answer" to provide more information below...

0

Eitan, thanks for replying.

I'm not sure if this worked or not, as not the cloudwatch logs are not showing the number of workers anymore! The performance seems to be the same, however. It's failing more often than it's responding. And still reaching almost 100% memory.

Instead of your code, I used the following, as I'm deploying a Hugging Face model:

huggingface_model = HuggingFaceModel(
    name=model_name,
    model_data=os.path.join("s3://" + tar_bucket_name, tarfile_name),
    env={
        'HF_TASK': 'text-classification',
        'MODEL_SERVER_WORKERS': '1',
        'MODEL_SERVER_TIMEOUT': '300'
    },
    role=sagemaker.get_execution_role(),
    entry_point='inference.py',
    transformers_version='4.12.3',
    pytorch_version='1.9.1',
    py_version='py38'
)

Two follow up questions then, if you don't mind:

  1. How can I see if the serverless function actually created only one worker per instance?
  2. Where can I find all the different environment variables accepted by SageMaker?

Many thanks!

Rogerio

已回答 2 年前
  • Hi! I created the model (using CDK) with the environment variable SAGEMAKER_MODEL_SERVER_WORKERS. Maybe that makes the difference?

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南