Running multi-node parallel job in AWS Batch using R

1

Hi, I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently few statistical models for several users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My question is to better understand the architecture of it. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Docker image pushed to ECR. My question is:

The parallel logic should be placed inside the R code, while using same image? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough? Or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run the models for user6-10, etc.. ??

Could you please share some ideas on that topic for better understanding? Much appreciated.

1 Answer
1
Accepted Answer

Running a multi-node parallel job in AWS Batch using R can be achieved by using a combination of R's parallel processing capabilities and AWS Batch's built-in support for distributed computing.

To containerize your R application code, you can use a Docker image and push it to Amazon Elastic Container Registry (ECR). Once the image is available in ECR, you can use it to create a job definition in AWS Batch.

As for the parallel logic, it can be placed inside the R code. You can use R's built-in parallel processing libraries such as 'parallel' or 'foreach' to split your job into chunks and run them in parallel. For example, you can use the 'foreach' package to define a parallel loop that will run the models for different users in parallel. In this case, you don't need to define the parallel logic in the Dockerfile.

To define how many chunks your job will be split into, you can use the 'registerDoParallel' function to specify the number of parallel workers. You can also set the number of vCPUs allocated for the container in the job definition.

AWS Batch will automatically scale the number of instances based on the number of jobs that are waiting to be executed and the number of available instances. You can also configure the number of instances in the compute environment.

In summary, you can place the parallel logic inside the R code, and use the same image to create the job definition. AWS Batch will automatically split the job into chunks based on the parallel workers specified in the R code and the number of vCPUs allocated for the container.

profile picture
answered a year ago
profile picture
EXPERT
reviewed 10 months ago
  • Hi Victor and thank you very much for your reply.

    This is exactly how I have written my R code. I am using:

    cores = parallel::detectCores()
    workers = parallel::makeCluster(cores, type="PSOCK")
    doParallel::registerDoParallel(workers)
    output = foreach::foreach(i = 1:length(tasks), .packages = c("library1","library2",..etc.. )) %dopar% {.........} logic to loop over the independent tasks.
    

    The part that I was missing was "how Batch knows exactly how to chuck/distribute the jobs in different instances". But the answer, based on your answer, mush be: Batch is clever enough to understand this, scale the instances and distribute tasks based on the foreach-queue created and the EC2 resources-containers that you have specified in the job definition.
    I will try to run this Rcode with the same image with multiple EC2 defined in the submission, and fingers crossed this will do the work. I will leave an update.

  • Just an update: Batch multi-node parallel process works fine in R using the logic and code as described above. It also looks like Batch is clever enough to merge the nodes' separate results into a single object after the job is done. Make sure that your Batch's Subnet, Security group, VPC, Placement group and IAM roles have been set appropriately. Also make sure that your EC2 quota limits are high enough to support your compute requirements. Finally, make sure you set a large enough container memory (and cpu) value so that your nodes don't get killed unexpectedly (exit-code 137).

  • I am also posting a Dockerfile template that helped me do my containerized work with R in the AWS Batch.

    FROM r-base:4.2.1
    ENV AWS_ACCESS_KEY_ID (.....)   # if needed
    ENV AWS_SECRET_ACCESS_KEY (.....)   # if needed
    ENV AWS_DEFAULT_REGION (.....)  # if needed
    #Update the package lists for upgrades for packages that need upgrading
    RUN apt-get update 
    RUN apt-get install -y libcurl4-openssl-dev libxml2-dev libssl-dev libfontconfig1-dev 
    RUN apt-get install -y libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev
    #Copy my packrat .lock file with all the librabies needed
    COPY packrat/packrat.lock ./packrat/packrat.lock
    RUN install2.r packrat
    RUN apt-get update && \
    apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && \
    apt-get clean;
    ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
    RUN R CMD javareconf
    RUN echo '.libPaths("./packrat/lib/x86_64-pc-linux-gnu/4.2.1")' >> /etc/R/Rprofile.site
    RUN Rscript -e 'packrat::restore()'
    COPY My_Pretty_Script.R .
    CMD ["Rscript", "My_Pretty_Script.R"]
    

    I hope people find this conversation useful.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions