EMR bootstrap script with pip numpy installation fails on r6+ instances

0

I recently tested moving from r5 to r6 instance fleets for our PySpark script. It has a dependency on numpy and pandas that is installed via pip in a bootstrap script, along with a few other dependencies for communicating with s3:

#!/bin/bash -xe

echo "---------------------------------------------------------"

echo "using python version:"
python3 --version
echo "initial python packages (sudo python3 -m pip list):"
sudo python3 -m pip list

echo "---------------------------------------------------------"

echo "install python3-dev development tools"
sudo yum -y install python3-devel

echo "---------------------------------------------------------"

echo "installing python dependencies"
sudo python3 -m pip install -U pip
echo "pip installed/updated"
sudo python3 -m pip install -U setuptools
echo "setuptools installed"
sudo python3 -m pip install \
    cloudpickle==1.6.0 \
    boto3==1.21.7 \
    fsspec==2022.2.0 \
    s3fs==0.4.2
echo "aws dependencies installed (boto3, cloudpickle, fsspec, s3fs)"
# sudo python3 -m pip install \
#     pandas==1.1.5 \
#     numpy==1.16.5
# echo "pandas + numpy installed"
sudo python3 -m pip install pandas==1.2.5
echo "pandas installed"

echo "final python packages (sudo python3 -m pip list):"
sudo python3 -m pip list

This runs without failure on r5 instances, and numpy is available in the python environment as expected.

When allowing {r6, r6g) instance types, the bootstrap script fails with the following message:

      _configtest.c:1:10: fatal error: Python.h: No such file or directory
       #include <Python.h>
                ^~~~~~~~~~
      compilation terminated.
      failure.
      removing: _configtest.c _configtest.o
      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/setup.py", line 419, in <module>
          setup_package()
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/setup.py", line 411, in setup_package
          setup(**metadata)
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/core.py", line 171, in setup
          return old_setup(**new_attr)
        File "/usr/local/lib/python3.7/site-packages/setuptools/__init__.py", line 155, in setup
          return distutils.core.setup(**attrs)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
          return run_commands(dist)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
          dist.run_commands()
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
          self.run_command(cmd)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/install.py", line 62, in run
          r = self.setuptools_run()
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/install.py", line 36, in setuptools_run
          return distutils_install.run(self)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/command/install.py", line 670, in run
          self.run_command('build')
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build.py", line 47, in run
          old_build.run(self)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
          cmd_obj.run()
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 148, in run
          self.build_sources()
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 165, in build_sources
          self.build_extension_sources(ext)
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 322, in build_extension_sources
          sources = self.generate_sources(sources, ext)
        File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 375, in generate_sources
          source = func(extension, build_dir)
        File "numpy/core/setup.py", line 423, in generate_config_h
          moredefs, ignored = cocache.check_types(config_cmd, ext, build_dir)
        File "numpy/core/setup.py", line 47, in check_types
          out = check_types(*a, **kw)
        File "numpy/core/setup.py", line 281, in check_types
          "install {0}-dev|{0}-devel.".format(python))
      SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> numpy

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Note, this bootstrap script attempts to address the problem from the error message by installing python-devel via yum before running the pip install.

asked 2 years ago1052 views
1 Answer
0

I suspect you need arm64 version of numpy, probably might have to build it as wheel file and then add it to the EMR cluster

AWS
Alex_T
answered 2 years ago
  • ahhhhhhhhhh. good thinking, that makes sense. and, that solution is reasonable, but it also means that i would have to fully commit my instance fleet to the AMD architecture if it's coming in via my Spark --py-files (wheels) parameter.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions