Background
Our team uses sagemaker processing job to do data baselining, which is to produce statistics result based on a large scale of training dataset. In other words, the goal of solving this issue is to increase InstanceCount
in our processing job.
Issue
If we specify more than 1 instance, the sagemaker model analyzer
"{{ sagemaker_analyzer_mapping[deployment]['account_id'] }}.dkr.ecr.{{ sagemaker_analyzer_mapping[deployment]['region'] }}.amazonaws.com/sagemaker-model-monitor-analyzer"
doesn't seem to work.
We configured the inter-container communications under /opt/ml/input/config/resourceconfig.json
and we were able to increase the instance from 1 to 4. However, the processsing jobs still fail if instance >= 5. Logs from CloudWatch:
2023-09-19 17:44:26,996 - DefaultDataAnalyzer - INFO - Running command: /usr/hadoop-3.0.0/bin/hdfs dfs -get /sagemaker/end_of_job local_eoj
get: Call From algo-9/10.2.118.69 to algo-1:9820 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I wonder is it because the image sagemaker-model-monitor-analyzer
can’t spin up more than 4 instances? If so, what's the best way to address this issue?