How can I install Python libraries on my EMR clusters?

3 minute read

I want to install external Python libraries on my Amazon EMR clusters

Short description

You can install Python libraries using a bootstrap action.

EMR uses puppet, the deployment mechanism used by Apache BigTop, to configure and initialize applications on instances. Instance-controller is EMR's software component that runs on every instance of the cluster. Instance-controller initializes and then provisions instances based on the instance configuration.

The instance-controller runs the provision-node script at /usr/share/aws/emr/node-provisioner/bin/provision-node to start NodeProvisioner at cluster startup. NodeProvisioner provisions all of the EMR distribution's applications for the node and cluster configuration. NodeProvisioner is treated as a final bootstrap action that runs after all other bootstrap actions are run on each node of the cluster.


In the latest EMR clusters, bootstrap actions run before Amazon EMR installs any applications specified at cluster creation. The bootstrap action runs before cluster nodes begin processing data. If you add nodes to a running cluster, then bootstrap actions also run on those nodes in the same way. You can create custom bootstrap actions and specify applications to install when you create your cluster. For more information, see Create bootstrap actions to install additional software.

Troubleshoot libraries installed by bootstrap actions that are overridden by default libraries

Libraries installed using bootstrap actions might be overridden by Amazon EMR default libraries. The bootstrap script runs before cluster creation and before node provisioning. So, libraries might be overridden by the default version.

To avoid this issue, create a delayed bootstrap action or a second stage bootstrap action as running code. Or, install packages after receiving the message NODEPROVISIONSTATE SUCCESSFUL.

The following bootstrap script upgrades the library after the application provisioning stage. You can add this script as a bootstrap script that runs in the background and exits successfully so that cluster provisioning continues. This script continues to monitor node provisioning and upgrades the library after provisioning.

The following example script upgrades the NumPy version:

while true; do
NODEPROVISIONSTATE=\` sed -n '/localInstance [{]/,/[}]/{
/nodeProvisionCheckinRecord [{]/,/[}]/ {
   /status: / { p }
}'  /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print \$2 }'\`
        if [ "\$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; then
                     sleep 10;
                echo "Running my post provision bootstrap"
                #your code here
                #Below example lines
                #sudo python3 -m pip uninstall numpy==1.16.5 (this is default version of numpy)
                #sudo python3 -m pip install --upgrade numpy==1.20.1 (new version of numpy)
sleep 10;

Note: In some cases, YARN containers running a Python package might not use an updated package that can be installed using the preceding resolution. If the container isn't running an updated package, you see module not found errors when trying to install. This is because the YARN NodeManager process is responsible for launching containers. The NodeManager's containers might already be running or allocated before the NODEPROVISIONSTATE is successful. This issue is often seen in multi-tenant clusters that have frequent auto scaling.

You can avoid module not found errors by polling the state of the nodemanager service. Then, run the desired bootstrap action as soon as the nodemanager starts.

AWS OFFICIALUpdated a year ago