Installing Python modules in EMR Cluster using EMR Notebook

0

Folks:

I am running some code that uses a mix of PySpark (for data manipulation) and Python (for visualization). Very similar to this blog: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/. The cluster I am using has all of the defaults:

Release label: emr-5.36.0
Hadoop distribution: Amazon 2.10.1
Applications:Spark 2.4.8, Livy 0.7.1, Hive 2.3.9, JupyterEnterpriseGateway 2.1.0

The command: sc.install_pypi_package("pandas") seems to work successfully.

However the command: sc.install_pypi_package("matplotlib") fails with an error on the Pillow dependency. The specific error is:

Building wheels for collected packages: unknown, unknown
  Running setup.py bdist_wheel for unknown: started
  Running setup.py bdist_wheel for unknown: finished with status 'error'
  Complete output from command /tmp/1669916958616-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-e28wksxd/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmprh77de26pip-wheel- --python-tag cp37:
  running bdist_wheel
  running build
  running build_ext
  
    The headers or library files could not be found for jpeg,  a required dependency when compiling Pillow from source.

I logged into the Master node on the EMR cluster and attempted to install some of the libraries and Python compiler support using:

sudo yum install python3-devel redhat-rpm-config libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel \
    freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel \
    harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel

While the installation of pillow gets a bit further, there are still errors such as:

Collecting pillow
  Using cached https://files.pythonhosted.org/packages/16/11/da8d395299ca166aa56d9436e26fe8440e5443471de16ccd9a1d06f5993a/Pillow-9.3.0.tar.gz
Building wheels for collected packages: unknown, unknown
  Running setup.py bdist_wheel for unknown: started
  Running setup.py bdist_wheel for unknown: finished with status 'done'
  Stored in directory: /var/lib/livy/.cache/pip/wheels/55/5a/ad/9f708fd6d1500e9ff680e17b1c2f436e8439477a5a226611c6
  Running setup.py bdist_wheel for unknown: started
  Running setup.py bdist_wheel for unknown: finished with status 'error'
  Complete output from command /tmp/1669918064446-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpn9k3ctzopip-wheel- --python-tag cp37:
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/1669918064446-0/lib64/python3.7/tokenize.py", line 447, in open
      buffer = _builtin_open(filename, 'rb')
  FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py'

I feel as if I am missing something very obvious. It can not possibly be this difficult to get a commonly used package like matplotlib to work.

Any suggestions?

Thanks

Rich H.

asked a year ago162 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions