Skip to content

EMR 7.13.0: Bootstrap pip3 install targets Python 3.9 site-packages while Spark/YARN workloads run Python 3.11, causing ModuleNotFoundError for PyPI packages (e.g. boto3)

2

Service and environment

  • Product: Amazon EMR
  • Release label: emr-7.13.0
  • Applications: Spark, Livy (and others as applicable)
  • Region: [e.g. us-east-1]
  • Workload: PySpark / Livy submitting Python driver code that imports packages installed in a bootstrap action

Summary On EMR 7.13, our bootstrap script installs Python dependencies with sudo pip3 install .... Packages are installed under /usr/local/lib/python3.9/site-packages. At runtime, Spark executors / the Python process used for the submitted job appears to be Python 3.11, so those packages are not on sys.path and imports fail (e.g. ModuleNotFoundError: No module named 'boto3').

On EMR 7.12, the same bootstrap pattern works without this mismatch, which suggests a regression or an undocumented change in default Python selection between 7.12 and 7.13.

Expected behavior

  • The Python interpreter used by PySpark / Livy for user code should be the same as the one targeted by the documented / default pip3 on cluster nodes, or
  • EMR documentation and bootstrap examples should clearly state which python / pip to use on 7.13 so customer bootstrap actions install into the correct site-packages.

Actual behavior

  • pip3 installs into Python 3.9 paths.
  • Jobs run under Python 3.11 (observed behavior on worker nodes).
  • Third-party modules installed only for 3.9 are not importable in the job

Steps to reproduce

  • Create a cluster with emr-7.13.0, Spark + Livy.
  • Add a bootstrap action: sudo pip3 install boto3 (I expect it should be installed in EMR by default).
  • Confirm install location (e.g. under python3.9 site-packages).
  • Submit a Spark / Livy job whose main script does import boto3.
  • Observe failure in YARN container logs: ModuleNotFoundError: No module named 'boto3'.
  • Repeat the same bootstrap + job on emr-7.12.x and observe success.

Evidence (from our investigation)

  • Bootstrap installs resolved to: /usr/local/lib/python3.9/site-packages.
  • Worker/runtime Python for the job: 3.11 (mismatch with install target).
  • Aligning EMR version to 7.12 avoids the mismatch with our current bootstrap.

Temporary mitigation (customer-side)

  1. Short term: Use EMR emr-7.12.x for clusters where bootstrap uses pip3 and PySpark imports those packages (verified workaround in our environment).
  2. Alternative (if staying on 7.13): Install packages with the same interpreter Spark uses, e.g. sudo /usr/bin/python3.11 -m pip install ... (exact path should be confirmed per AMI), and/or set PYSPARK_PYTHON / PYSPARK_DRIVER_PYTHON explicitly to match.

We consider (2) fragile without official guidance per release, because paths and defaults can change between AMI updates.

asked a month ago155 views
2 Answers
3

As far as I understand your issue, the configuration keys mentioned (e.g., spark.emr-serverless...) are specific to EMR Serverless. Since you are using EMR on EC2 (with Bootstrap Actions and YARN), those specific parameters won't apply to your setup.

To fix this properly on EMR 7.13.0, you should combine the explicit pip install with a cluster configuration to ensure Spark consistently uses the correct interpreter.

Recommended Configuration for EMR 7.13.0 (EC2)

Add this to your EMR cluster configuration (JSON) to explicitly bind PySpark to the version where you installed your packages:

[
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3.11",
          "PYSPARK_DRIVER_PYTHON": "/usr/bin/python3.11"
        }
      }
    ]
  }
]

Updated Bootstrap Action

Ensure your bootstrap script targets the specific binary as suggested in the "Alternative" section:

sudo /usr/bin/python3.11 -m pip install boto3 <your-other-packages>

For me, this approach is more stable than relying on the pip3 alias, which may point to older system Python versions (like 3.9) for OS compatibility reasons.

EXPERT
answered a month ago
1

The issue you're experiencing stems from a Python version mismatch between your bootstrap action and the runtime environment in EMR 7.13.0. When you use pip3 install in your bootstrap action, it's installing packages to Python 3.9's site-packages, but your Spark/YARN workloads are running with Python 3.11, which has a separate site-packages directory.

To resolve this issue, you need to ensure that packages are installed for the same Python version that Spark will use at runtime. Here are the recommended approaches:

Recommended Solution: Use Python Virtual Environments

The most robust approach is to create a Python virtual environment with your desired Python version and dependencies, then configure Spark to use it. This method ensures consistency across your cluster:

  1. Build a virtual environment with Python 3.11 and your required packages
  2. Package the virtual environment as a tar.gz archive
  3. Upload the archive to Amazon S3
  4. Configure your Spark job to use the virtual environment by setting:
  • spark.archives to point to your S3 archive
  • spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON to the Python binary in the archive
  • spark.emr-serverless.driverEnv.PYSPARK_PYTHON to the Python binary in the archive
  • spark.executorEnv.PYSPARK_PYTHON to the Python binary in the archive

Alternative: Target the Correct Python Version in Bootstrap

If you prefer to continue using bootstrap actions, explicitly target Python 3.11 when installing packages:

sudo /usr/bin/python3.11 -m pip install boto3

This ensures packages are installed to the Python 3.11 site-packages directory that Spark will use at runtime.

Additional Consideration

You may also want to explicitly set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables in your Spark configuration to ensure consistency across driver and executor processes.

The virtual environment approach is generally more maintainable and portable across EMR versions, as it doesn't rely on assumptions about the default Python version on the AMI.
Sources
Using different Python versions with EMR Serverless - Amazon EMR

answered a month ago
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.