Endpoint did not pass the ping health check but in CloudWatch /ping status is 200

0

I have a fine tuned LLaMa2 70b-chat-hf artifacts which are stored on S3 as a tarball archive.

When I deploy the model on SageMaker the endpoint is moved to failed state with the following message:

The primary container for production variant <> did not pass the ping health check. Please check CloudWatch logs for this endpoint.

But in the CloudWatch I can see the app is up and running and there're bunch of successful /ping endpoint responses:

Enter image description here

When I deploy the base llama2-70b-chat-hf model there're no issues.

Can you advise how to resolve the issue?

Igor
질문됨 4달 전213회 조회
1개 답변
0

Hello The following procedures will help you troubleshoot the endpoint health check issue even if the /ping endpoint displays a 200 status:

Perform a thorough analysis of the CloudWatch logs.

Analyze the CloudWatch logs for the endpoint in detail, keeping an eye out for any failures, warnings, or unusual activity that might be interfering with the health check success. When the health check fails, pay special attention to the logs. Verify that there are no conflicts within the container and that no resource exhaustion or dependency errors exist.

Check the Model Artifacts:

In the S3 tarball archive, make sure the adjusted model artifacts are appropriately packaged. Verify the presence of all required files and dependencies for the proper operation of the model. The model may not load correctly if any files are missing or corrupted, which could result in failed health checks.

답변함 4달 전
  • Unfortunately nothing suspicious: no errors, tracebacks or warnings, application was started, then a lot of successful /ping responses. After process hit container_startup_health_check_timeout_in_seconds limit it was terminated

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠