Installing python packages to the SageMaker PySparkProcessor (multi-instance setup)

0

Hi,

I am exploring the AWS sagemaker PySpark processor for data preprocessing (see here). We are in an NLP scenario, where we aim to process large quantity of texts in a distributed fashion. Locally, on a smaller test input our PySpark script runs fine. Yet on the cluster, we need to install additional dependencies and models.

Questions:

  1. Is there a reliable way to install python packages to all nodes in a cluster?
  2. Is there a way to execute initializations like nltk.download() once per cluster node?

What I tried : I read a variety of documentation and blog posts without finding a reliable way to install python packages:

  1. Using shell execution: it seems that the ugly fix of os.system("python3 -m pip install <some_lib>) works in the single node setting. However, as soon as the cluster manager distributes work to other instances, this does not work anymore.

  2. Trying to access sc without import and using install_packages: I assumed there is an AWS EMR container under the hood and was hoping to install packages this way. However, sc was not set / defined in my submitted script.

  3. Trying to access sparkContext via import and using install_packages: The install_packages method is not pure pyspark but AWS EMR related. and does not exist.

  4. Using spark config to setup a virtual env with a requirements file (see https://gist.github.com/geosmart/3b345b4b335658fb05bb25f934a30723): This does not work as I can't figure out if and at what location the virtualenv binary is installed inside the container. Thus when booting the cluster stops because it does not find the binary.

So, I've slowly being running out of ideas. But there must be a way to do proper initialization (and hopefully one that isn't too complicated)

tsteuer
asked 9 months ago797 views
1 Answer
0

Hello,

Thankyou for using AWS Sagemaker !!

From the description I see that you are using the PySparkProcessor for data preprocessing using Sagemaker and the requirements is to install custom packages in all the nodes while using multi-instance setup. Unfortunately, at the moment its is not possible to install the custom packages on container provided by Sagemaker as it’s a managed service.

However, if you want to extend/add custom packages you can use bring your own container. Please refer the below links for Build Your Own Processing Container :

[+] Build Your Own Processing Container (Advanced Scenario) - https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html

[+] Build Your Own Processing Container (Advanced Scenario) - Run Your Processing Container Using the SageMaker Python SDK - https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html#byoc-run

If you experience any difficulty in implementing the above solution, I would recommend you please reach out to AWS Support[1] (Sagemaker), along with your issue/use case in detail and share relevant AWS resource names. We will be more than happy to assist you.

Hope this helps!

[1] https://support.console.aws.amazon.com/

AWS
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions