- Newest
- Most votes
- Most comments
As far as I understand your issue, the configuration keys mentioned (e.g., spark.emr-serverless...) are specific to EMR Serverless. Since you are using EMR on EC2 (with Bootstrap Actions and YARN), those specific parameters won't apply to your setup.
To fix this properly on EMR 7.13.0, you should combine the explicit pip install with a cluster configuration to ensure Spark consistently uses the correct interpreter.
Recommended Configuration for EMR 7.13.0 (EC2)
Add this to your EMR cluster configuration (JSON) to explicitly bind PySpark to the version where you installed your packages:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3.11",
"PYSPARK_DRIVER_PYTHON": "/usr/bin/python3.11"
}
}
]
}
]
Updated Bootstrap Action
Ensure your bootstrap script targets the specific binary as suggested in the "Alternative" section:
sudo /usr/bin/python3.11 -m pip install boto3 <your-other-packages>
For me, this approach is more stable than relying on the pip3 alias, which may point to older system Python versions (like 3.9) for OS compatibility reasons.
The issue you're experiencing stems from a Python version mismatch between your bootstrap action and the runtime environment in EMR 7.13.0. When you use pip3 install in your bootstrap action, it's installing packages to Python 3.9's site-packages, but your Spark/YARN workloads are running with Python 3.11, which has a separate site-packages directory.
To resolve this issue, you need to ensure that packages are installed for the same Python version that Spark will use at runtime. Here are the recommended approaches:
Recommended Solution: Use Python Virtual Environments
The most robust approach is to create a Python virtual environment with your desired Python version and dependencies, then configure Spark to use it. This method ensures consistency across your cluster:
- Build a virtual environment with Python 3.11 and your required packages
- Package the virtual environment as a tar.gz archive
- Upload the archive to Amazon S3
- Configure your Spark job to use the virtual environment by setting:
spark.archivesto point to your S3 archivespark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHONto the Python binary in the archivespark.emr-serverless.driverEnv.PYSPARK_PYTHONto the Python binary in the archivespark.executorEnv.PYSPARK_PYTHONto the Python binary in the archive
Alternative: Target the Correct Python Version in Bootstrap
If you prefer to continue using bootstrap actions, explicitly target Python 3.11 when installing packages:
sudo /usr/bin/python3.11 -m pip install boto3
This ensures packages are installed to the Python 3.11 site-packages directory that Spark will use at runtime.
Additional Consideration
You may also want to explicitly set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables in your Spark configuration to ensure consistency across driver and executor processes.
The virtual environment approach is generally more maintainable and portable across EMR versions, as it doesn't rely on assumptions about the default Python version on the AMI.
Sources
Using different Python versions with EMR Serverless - Amazon EMR
