Installing Python modules in EMR Cluster using EMR Notebook

0

Folks:

I am running some code that uses a mix of PySpark (for data manipulation) and Python (for visualization). Very similar to this blog: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/. The cluster I am using has all of the defaults:

Release label: emr-5.36.0
Hadoop distribution: Amazon 2.10.1
Applications:Spark 2.4.8, Livy 0.7.1, Hive 2.3.9, JupyterEnterpriseGateway 2.1.0

The command: sc.install_pypi_package("pandas") seems to work successfully.

However the command: sc.install_pypi_package("matplotlib") fails with an error on the Pillow dependency. The specific error is:

Building wheels for collected packages: unknown, unknown
  Running setup.py bdist_wheel for unknown: started
  Running setup.py bdist_wheel for unknown: finished with status 'error'
  Complete output from command /tmp/1669916958616-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-e28wksxd/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmprh77de26pip-wheel- --python-tag cp37:
  running bdist_wheel
  running build
  running build_ext
  
    The headers or library files could not be found for jpeg,  a required dependency when compiling Pillow from source.

I logged into the Master node on the EMR cluster and attempted to install some of the libraries and Python compiler support using:

sudo yum install python3-devel redhat-rpm-config libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel \
    freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel \
    harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel

While the installation of pillow gets a bit further, there are still errors such as:

Collecting pillow
  Using cached https://files.pythonhosted.org/packages/16/11/da8d395299ca166aa56d9436e26fe8440e5443471de16ccd9a1d06f5993a/Pillow-9.3.0.tar.gz
Building wheels for collected packages: unknown, unknown
  Running setup.py bdist_wheel for unknown: started
  Running setup.py bdist_wheel for unknown: finished with status 'done'
  Stored in directory: /var/lib/livy/.cache/pip/wheels/55/5a/ad/9f708fd6d1500e9ff680e17b1c2f436e8439477a5a226611c6
  Running setup.py bdist_wheel for unknown: started
  Running setup.py bdist_wheel for unknown: finished with status 'error'
  Complete output from command /tmp/1669918064446-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpn9k3ctzopip-wheel- --python-tag cp37:
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/1669918064446-0/lib64/python3.7/tokenize.py", line 447, in open
      buffer = _builtin_open(filename, 'rb')
  FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py'

I feel as if I am missing something very obvious. It can not possibly be this difficult to get a commonly used package like matplotlib to work.

Any suggestions?

Thanks

Rich H.

feita há um ano184 visualizações
Sem respostas

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas