Folks:
I am running some code that uses a mix of PySpark (for data manipulation) and Python (for visualization). Very similar to this blog:
https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/. The cluster I am using has all of the defaults:
Release label: emr-5.36.0
Hadoop distribution: Amazon 2.10.1
Applications:Spark 2.4.8, Livy 0.7.1, Hive 2.3.9, JupyterEnterpriseGateway 2.1.0
The command: sc.install_pypi_package("pandas")
seems to work successfully.
However the command: sc.install_pypi_package("matplotlib")
fails with an error on the Pillow dependency. The specific error is:
Building wheels for collected packages: unknown, unknown
Running setup.py bdist_wheel for unknown: started
Running setup.py bdist_wheel for unknown: finished with status 'error'
Complete output from command /tmp/1669916958616-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-e28wksxd/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmprh77de26pip-wheel- --python-tag cp37:
running bdist_wheel
running build
running build_ext
The headers or library files could not be found for jpeg, a required dependency when compiling Pillow from source.
I logged into the Master node on the EMR cluster and attempted to install some of the libraries and Python compiler support using:
sudo yum install python3-devel redhat-rpm-config libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel \
freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel \
harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel
While the installation of pillow gets a bit further, there are still errors such as:
Collecting pillow
Using cached https://files.pythonhosted.org/packages/16/11/da8d395299ca166aa56d9436e26fe8440e5443471de16ccd9a1d06f5993a/Pillow-9.3.0.tar.gz
Building wheels for collected packages: unknown, unknown
Running setup.py bdist_wheel for unknown: started
Running setup.py bdist_wheel for unknown: finished with status 'done'
Stored in directory: /var/lib/livy/.cache/pip/wheels/55/5a/ad/9f708fd6d1500e9ff680e17b1c2f436e8439477a5a226611c6
Running setup.py bdist_wheel for unknown: started
Running setup.py bdist_wheel for unknown: finished with status 'error'
Complete output from command /tmp/1669918064446-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpn9k3ctzopip-wheel- --python-tag cp37:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/1669918064446-0/lib64/python3.7/tokenize.py", line 447, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py'
I feel as if I am missing something very obvious. It can not possibly be this difficult to get a commonly used package like matplotlib to work.
Any suggestions?
Thanks
Rich H.