The image sagemaker-model-monitor-analyzer in processing job can’t support more than 4 instances

0

Background

Our team uses sagemaker processing job to do data baselining, which is to produce statistics result based on a large scale of training dataset. In other words, the goal of solving this issue is to increase InstanceCount in our processing job.

Issue

If we specify more than 1 instance, the sagemaker model analyzer "{{ sagemaker_analyzer_mapping[deployment]['account_id'] }}.dkr.ecr.{{ sagemaker_analyzer_mapping[deployment]['region'] }}.amazonaws.com/sagemaker-model-monitor-analyzer" doesn't seem to work.

We configured the inter-container communications under /opt/ml/input/config/resourceconfig.json and we were able to increase the instance from 1 to 4. However, the processsing jobs still fail if instance >= 5. Logs from CloudWatch:

2023-09-19 17:44:26,996 - DefaultDataAnalyzer - INFO - Running command: /usr/hadoop-3.0.0/bin/hdfs dfs -get /sagemaker/end_of_job local_eoj

get: Call From algo-9/10.2.118.69 to algo-1:9820 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

I wonder is it because the image sagemaker-model-monitor-analyzer can’t spin up more than 4 instances? If so, what's the best way to address this issue?

asked 6 months ago242 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions