Fail to create Endpoints in SageMaker--JNI and NoClassDefFound error

0

I was trying to deploy a model(logged as a Mleap model in Databricks and saved in a s3 bucket) to SageMaker, and got stuck at the Endpoint creation:

"The primary container for production variant [xxx] did not pass the ping health check. Please check CloudWatch logs for this endpoint."

In the log I found the following block repeating over and over again until some time later the creating Endpoint process just stopped and the status turned 'failed' in the SageMaker UI:

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/eclipse/jetty/util/thread/ThreadPool
#011at java.lang.Class.getDeclaredMethods0(Native Method)
#011at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
#011at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
#011at java.lang.Class.getMethod0(Class.java:3018)
#011at java.lang.Class.getMethod(Class.java:1784)
#011at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
#011at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.util.thread.ThreadPool
#011at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
#011at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
#011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
#011at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
#011... 7 more
Got sigterm signal, exiting.

I coded in Python in Databricks and as far as I could tell this is a Java error so I have no idea what went wrong, anyone has any experience deploying an 'outside' model to SageMaker? Any help or tips would be much appreciated!

FYI: my whole s3 setup, ECR setup were in us-west-2(Oregon); the integration of AWS IAM role and Databricks Role are properly set up; in my s3, I could see the Databricks distributed filesystem in one bucket and the trained and pickled model in another; the docker image that is supposed to hold the model is successfully registered in ECR. I also tried to change the instance type under 'Production variants' in the Endpoint creation settings, I set it to the same instance type (ml.m5.large) as the one I used to initiate the Databricks runtime cluster but it did not seem to work.

Update: I successfully trained, logged and deployed a sklearn model, but still have the same issue with spark ML model; for the container, I used a image built by mlflow:

mlflow sagemaker build-and-push-container

Edited by: ShumZZ on Sep 29, 2019 3:12 PM

Edited by: ShumZZ on Oct 2, 2019 2:42 PM

ShumZZ
asked 5 years ago340 views
3 Answers
0

Hi ShumZZ,

I'm assuming your model container is built through Mleap's Databrick runtime integration. From the code base (https://github.com/combust/mleap/tree/master/mleap-databricks-runtime-fat), it seems that the underlying implementation is in Scala, which would require JNI bindings to interact with your Python code.

Have you tried running your model container locally? If the error persists when in your local environment, I would suggest reaching out to the Mleap community for assistance. To run your container locally, please follow the commands in
SageMaker documentation https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html

Thank you very much for trying out Amazon SageMaker! Please let us know if you have additional questions.

Best Regards,
Yijie

answered 5 years ago
0

Hi Yijie,
Thanks so much for the reply! I followed your advice and saw indeed that the local test also failed with the same JNI error...

Yet I am a bit confused since I actually used Mlflow's CLI mlflow sagemaker build-and-push-container to build the container. I am quite new to all these concepts (this is actually the first time I've ever worked on an end-to-end ML project), so correct me if I am wrong, should I reach out to Mlflow community instead of Mleap? Or is it the case that under the hood the container is indeed built through Mleap's Databrick runtime integration? Any help/ clarifications/ advices would be much appreciated:)

Links that might be useful but I do not quite understand...
https://github.com/mlflow/mlflow/blob/master/mlflow/sagemaker/cli.py
https://github.com/mlflow/mlflow/blob/master/mlflow/models/docker_utils.py

ShumZZ
answered 5 years ago
0

After some discussion with the Mlflow community, we confirmed the bug, where Java dependencies are not correctly installed in the docker image that Mlflow uses by default. I posted a bug report (https://github.com/mlflow/mlflow/issues/1906) on Github and a temporary fix has been provided by @smurching (https://github.com/mlflow/mlflow/pull/1913).

Edited by: ShumZZ on Oct 10, 2019 12:04 PM

ShumZZ
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions