EMR Studio PySpark Kernel uses lowered version of pip

0

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.

Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the main node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.

In the notebook cell if I run import pip then pip I see that the module is located at

<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'> 

I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.

If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.

How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?

If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?

<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>

As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".

Lastly I have noticed in this AWS blog post that there is a picture which shows pip at version 19.2.3 but I am not sure how that was achieved. It is below the console output for the command sc.list_packages(). https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

asked 3 years ago254 views
1 Answer
0

The python environment that EMR notebook uses is "/emr/notebook-env/bin/python" which is different from the default "/usr/bin/python". This is the reason why we observe the differences. You may also notice the difference between the pip list and !pip list if we run from EMR notebook and the explanation is same.

So as a next step:

  • you can install the python dependency from the EMR notebook manually if needed.

  • In case you wish to automate the installation with the EMR that you needed to use with EMR notebook, you can consider to use the below script as a Bootstrap action[1], so that they get installed in both python environments:

#!/bin/bash
sudo pip3 install <dependency>
sudo /emr/notebook-env/bin/pip install <dependency>

But the catch here is you need to use the delayed bootstrap action script so that once the EMR cluster comes into WAITING state, then after that the bootstrap action runs, see here - https://repost.aws/knowledge-center/emr-update-all-nodes-bootstrap . Delayed bootstrap action is needed because by default when the bootstrap will run, the cluster won't find /emr/notebook-env path and so Bootstrap will fail which will terminate the cluster.

You might already be aware that by default, the Bootstrap action runs before the application provisioning phase of the EMR cluster.

References: [1]: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

AWS
answered 7 months ago
profile pictureAWS
SUPPORT ENGINEER
reviewed 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions