I want to install and troubleshoot Python libraries in Amazon EMR and Amazon EMR Serverless clusters.
Resolution
Install Python libraries in Amazon EMR clusters
To install python libraries in Amazon EMR clusters, use a bootstrap action.
Amazon EMR uses puppet, an Apache BigTop deployment mechanism, to configure and initialize applications on instances. Instance-controller is an Amazon EMR software component that runs on every cluster instance. Instance-controller initializes, and then provisions instances based on the instance configuration.
To start NodeProvisioner at the cluster startup, the instance controller runs the provision node script /usr/share/aws/emr/node-provisioner/bin/provision-node. Then, NodeProvisioner provisions all of the Amazon EMR distribution applications for the node and cluster configuration. NodeProvisioner is a final bootstrap action that runs after all other bootstrap actions run on each cluster node.
For the latest Amazon EMR clusters, bootstrap actions run before Amazon EMR installs applications that are specified at the cluster creation. Also, the bootstrap action runs before cluster nodes process data. If you add nodes to a running cluster, then bootstrap actions run on those nodes. You can create custom bootstrap actions and specify applications to install when you create your cluster.
Install Python libraries in Amazon EMR Serverless clusters
To install Python libraries and use their capabilities within your Spark jobs and notebooks, use one of the following methods based on your use case:
Troubleshoot Python libraries
Python libraries that are installed by bootstrap actions might be overrode by Amazon EMR default libraries. To resolve this issue, create a delayed bootstrap actions or a second stage bootstrap action as a running code. Or, install the packages after you receive the NODEPROVISIONSTATE SUCCESSFUL message.
The following bootstrap action upgrades the library after the application provisioning stage. Add this script as a bootstrap script that runs in the background and exits so that cluster provisioning continues. This script continues to monitor node provisioning and upgrades the library after provisioning.
Example script that upgrades the NumPy version:
#!/bin/bash
set -x
cat > /var/tmp/fix-bootstap.sh <<'EOF'
#!/bin/bash
set -x
while true; do
NODEPROVISIONSTATE=`sed -n '/localInstance [{]/,/[}]/{
/nodeProvisionCheckinRecord [{]/,/[}]/ {
/status: / { p }
/[}]/a
}
/[}]/a
}' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'`
if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then
echo "Running my post provision bootstrap"
# your code here
sudo /mnt/notebook-env/bin/pip install pandas==1.3.5
sudo /mnt/notebook-env/bin/pip install boto==2.49.0
sudo /mnt/notebook-env/bin/pip install boto3==1.25.0
exit
else
echo "Sleeping Till Node is Provisioned"
sleep 10
fi
done
EOF
chmod +x /var/tmp/fix-bootstap.sh
nohup /var/tmp/fix-bootstap.sh 2>&1 &
Note: YARN containers that run a Python package might not use an updated package that can be installed with the preceding resolution. As a result, you'll receive module not found errors when you attempt to install an updated package. To prevent module not found errors, poll the nodemanager service state. Then, run the desired bootstrap action when the nodemanager starts.