EMR Studio PySpark Kernel uses lowered version of pip

0

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.

Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the main node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.

In the notebook cell if I run import pip then pip I see that the module is located at

<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'> 

I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.

If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.

How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?

If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?

<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>

As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".

Lastly I have noticed in this AWS blog post that there is a picture which shows pip at version 19.2.3 but I am not sure how that was achieved. It is below the console output for the command sc.list_packages(). https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

rk01
질문됨 2년 전134회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠