Endpoint did not pass the ping health check but in CloudWatch /ping status is 200

0

I have a fine tuned LLaMa2 70b-chat-hf artifacts which are stored on S3 as a tarball archive.

When I deploy the model on SageMaker the endpoint is moved to failed state with the following message:

The primary container for production variant <> did not pass the ping health check. Please check CloudWatch logs for this endpoint.

But in the CloudWatch I can see the app is up and running and there're bunch of successful /ping endpoint responses:

Enter image description here

When I deploy the base llama2-70b-chat-hf model there're no issues.

Can you advise how to resolve the issue?

Igor
已提問 5 個月前檢視次數 218 次
1 個回答
0

Hello The following procedures will help you troubleshoot the endpoint health check issue even if the /ping endpoint displays a 200 status:

Perform a thorough analysis of the CloudWatch logs.

Analyze the CloudWatch logs for the endpoint in detail, keeping an eye out for any failures, warnings, or unusual activity that might be interfering with the health check success. When the health check fails, pay special attention to the logs. Verify that there are no conflicts within the container and that no resource exhaustion or dependency errors exist.

Check the Model Artifacts:

In the S3 tarball archive, make sure the adjusted model artifacts are appropriately packaged. Verify the presence of all required files and dependencies for the proper operation of the model. The model may not load correctly if any files are missing or corrupted, which could result in failed health checks.

已回答 5 個月前
  • Unfortunately nothing suspicious: no errors, tracebacks or warnings, application was started, then a lot of successful /ping responses. After process hit container_startup_health_check_timeout_in_seconds limit it was terminated

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南