Glue + SageMaker Pip Packages

0

My customer is looking to use Glue dev endpoints along with a SageMaker notebook. What I've noticed is that in Glue, a package, in this case scipy, will be listed as 1.4.1, but this will or won't match what you get in a sagemaker notebook dependent on kernel.

conda_python3:

!pip show scipy
Name: scipy
Version: 1.1.0
Summary: SciPy: Scientific Library for Python
Home-page: https://www.scipy.org
Author: None
Author-email: None
License: BSD
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: 
Required-by: seaborn, scikit-learn, sagemaker

conda_tensorflow_p36:

!pip show scipy
Name: scipy
Version: 1.4.1
Summary: SciPy: Scientific Library for Python
Home-page: https://www.scipy.org
Author: None
Author-email: None
License: BSD
Location: /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages
Requires: numpy
Required-by: seaborn, scikit-learn, sagemaker, Keras

Is there some sort of best practice to use a kernel that corresponds directly to what's installed on Glue?

Separate not very useful question. I wasn't able activate the venv that Jupyter notebooks do via shell. Is it using a venv? How come I can't find the right activate script?

asked 3 years ago92 views
1 Answer
0
Accepted Answer

conda_python3 and conda_tensorflow_p36 are local kernels on the SageMaker notebook instance while the Spark kernels execute remotely in the Glue Spark environment.

Hence you are seeing different versions. The Glue Spark environment comes with 1.4.1 version of scipy. So when you use the PySpark (python) or Spark (scala) kernels and you will get the 1.4.1 version of scipy.

If you use the default LifeCycle script that Glue SageMaker notebooks already come with, the connectivity to the Glue Dev endpoint should be in place. Note that the Glue SageMaker notebooks has a tag called 'aws-glue-dev-endpoint' that is used to identify which Glue Dev endpoint that particular notebook instance communicates with.

The Spark kernels cannot be replicated via the python shell. Those kernels relay Spark commands via the Livy service to Spark on the Glue Dev endpoint using a Jupyter module called Sparkmagic.

Ref: https://github.com/jupyter-incubator/sparkmagic

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions