I want to use the Sparkmagic (PySpark) kernel to run an Amazon SageMaker notebook instance. I used pip to install the Python libraries, but I got the following error: "ModuleNotFoundError: No module named my_module_name."
Short description
When you use the Sparkmagic kernel, the SageMaker notebook acts as an interface for the Apache Spark session. The Apache Spark session runs on a remote Amazon EMR cluster or an AWS Glue development endpoint. When you use pip to install the Python library on the notebook instance, the library is available only to the local notebook instance. To resolve ModuleNotFoundError, install the library on the AWS Glue development endpoint or on each node of the EMR cluster.
Note: If the code that uses the library isn't compute intensive, then use local mode (%%local). Local mode runs the cell only on the local notebook instance. When you use local mode, you don't need to install the library on the remote cluster or development endpoint.
Resolution
Install a library on an AWS Glue development endpoint
To install libraries on an AWS Glue development endpoint, see Loading Python libraries in a development endpoint.
Install a library on an Amazon EMR cluster
Note: In the following commands, replace Example-library with the library that you want to use.
To install libraries on a remote Amazon EMR cluster, use a bootstrap action when you're creating the cluster. If you connected an Amazon EMR cluster to the SageMaker notebook instance, then manually install the library on all cluster nodes.
Complete the following steps:
-
Use SSH to connect to the primary node.
-
Install the library:
sudo python -m pip install pandas
-
Confirm that the module is successfully installed:
python -c "import Example-library as pd; print(pd.__version__)"
-
Open the Amazon SageMaker notebook instance, and then restart the kernel.
-
To confirm that the library works, run a command that requires the library, such as the following one:
pdf = spark.sql("show databases").toExample-library()
-
Use SSH to connect to the other cluster nodes, and then install the library on each node.
Use local mode
If you don't need to run the code on the remote cluster or development endpoint, then use the local notebook instance. For example, don't install matplotlib on each node of the Spark cluster. Instead, use local mode (%%local) to run the cell on the local notebook instance.
Note: In the following commands, replace the example variables with your variables.
To export results to a local variable and run the code in local mode, complete the following steps:
-
Export the result to a local variable:
%%sql -o query1SELECT 1, 2, 3
-
Locally run the code:
%%localprint(len(query1))
To use SageMakerEstimator in a Spark pipeline, run a local Spark session to modify the data. Then, use the SageMaker Spark library to train and make predictions. For more information, see sagemaker-spark on the AWS Labs GitHub repository.
To view an example notebook, see the pyspark_mnist_kmeans on the AWS Labs GitHub repository. The example notebook uses the conda_python3 kernel that isn't backed by an EMR cluster. For jobs with heavy workloads, create a remote Spark cluster, and connect the cluster to the notebook instance.
Related information
Use Apache Spark with Amazon SageMaker
Build Amazon SageMaker notebooks backed by Spark in Amazon EMR