Questions tagged with Amazon SageMaker Model Training

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

deploying previously trained model with Sagemaker Python SDK (StatusExceptionError)

I am using a pertained Random Forest Model and trying to deploy it on Amazon Sagemker using Python SDK: ``` from sagemaker.sklearn.estimator import SKLearn sklearn_estimator = SKLearn( entry_point='script.py', role = get_execution_role(), instance_count=1, instance_type='ml.m4.xlarge', framework_version='0.20.0', base_job_name='rf-scikit') sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False) sklearn_estimator.latest_training_job.wait(logs='None') artifact = m_boto3.describe_training_job( TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts'] print('Model artifact persisted at ' + artifact) ``` I get the following StatusException Error ``` 2022-08-25 12:03:27 Starting - Starting the training job.... 2022-08-25 12:03:52 Starting - Preparing the instances for training............ 2022-08-25 12:04:55 Downloading - Downloading input data...... 2022-08-25 12:05:31 Training - Downloading the training image......... 2022-08-25 12:06:22 Training - Training image download completed. Training in progress.. 2022-08-25 12:06:32 Uploading - Uploading generated training model. 2022-08-25 12:06:43 Failed - Training job failed --------------------------------------------------------------------------- UnexpectedStatusException Traceback (most recent call last) <ipython-input-37-628f942a78d3> in <module> ----> 1 sklearn_estimator.latest_training_job.wait(logs='None') 2 artifact = m_boto3.describe_training_job( 3 TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts'] 4 5 print('Model artifact persisted at ' + artifact) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 2109 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 2110 else: -> 2111 self.sagemaker_session.wait_for_job(self.job_name) 2112 2113 def describe(self): ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_job(self, job, poll) 3226 lambda last_desc: _train_done(self.sagemaker_client, job, last_desc), None, poll 3227 ) -> 3228 self._check_job_status(job, desc, "TrainingJobStatus") 3229 return desc 3230 ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 3390 message=message, 3391 allowed_statuses=["Completed", "Stopped"], -> 3392 actual_status=status, 3393 ) 3394 UnexpectedStatusException: Error for Training job rf-scikit-2022-08-25-12-03-25-931: Failed. Reason: AlgorithmError: framework error: Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main train(environment.Environment()) File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train runner_type=runner.ProcessRunnerType) File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run wait, capture_error File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run cwd=environment.code_dir, File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error info=extra_info, sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "" Command "/miniconda3/bin/python script.py" ExecuteUserScriptErr ``` The pertained model works fine and I don't know what the problem is, please help
1
answers
0
votes
33
views
asked 3 months ago

SageMaker Debugger: cannot load training information of estimator

I am using a SageMaker notebook for training a ML model. When I created and trained the estimator successfully with the following script, I could load the debugging information (s3_output_path) as expected: ``` from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs rules = [ Rule.sagemaker(rule_configs.loss_not_decreasing()), Rule.sagemaker(rule_configs.vanishing_gradient()), Rule.sagemaker(rule_configs.overfit()), Rule.sagemaker(rule_configs.overtraining()), Rule.sagemaker(rule_configs.poor_weight_initialization())] collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0", parameters={ "include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "100","eval.save_interval": "10"})] debugger_config = DebuggerHookConfig( collection_configs=collection_configs) estimator = PyTorch( role=sagemaker.get_execution_role(), instance_count=1, instance_type="ml.m5.xlarge", #instance_type="ml.g4dn.2xlarge", entry_point="train.py", framework_version="1.8", py_version="py36", hyperparameters=hyperparameters, debugger_hook_config=debugger_config, rules=rules, ) estimator.fit({"training": inputs}) s3_output_path = estimator.latest_job_debugger_artifacts_path() ``` After the kernel died, I attached the estimator and tried to access the debugging information of the training: ``` estimator = sagemaker.estimator.Estimator.attach('pytorch-training-2022-06-07-11-07-09-804') s3_output_path = estimator.latest_job_debugger_artifacts_path() rules_path = estimator.debugger_rules ``` The return values of these 2 functions were None. Could this be a problem with the attach-function? And how can I access training information of the debugger after the kernel was shut down?
1
answers
0
votes
74
views
asked 4 months ago

Sagemaker Training Job. Python modules installation Error

I have a problem with Python module installation that requires pre-installation of another module. Both modules were added to the `requirement.txt` file. However, the error occurs when installing main module: ```message 2022-07-29 01:18:26.460132: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler. "2022-07-29 01:18:26.470589: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped." 2022-07-29 01:18:26.765280: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler. "2022-07-29 01:18:31,908 sagemaker-training-toolkit INFO Imported framework sagemaker_tensorflow_container.training" "2022-07-29 01:18:31,917 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)" "2022-07-29 01:18:33,117 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt:" /usr/local/bin/python3.9 -m pip install -r requirements.txt Collecting Cython==0.29.31 Downloading Cython-0.29.31-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (2.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 33.1 MB/s eta 0:00:00 Requirement already satisfied: wheel==0.37.1 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 2)) (0.37.1) Collecting scikit-image==0.19.2 Downloading scikit_image-0.19.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 83.1 MB/s eta 0:00:00 Collecting parallelbar==0.1.19 Downloading parallelbar-0.1.19-py3-none-any.whl (5.6 kB) Collecting albumentations==1.0.3 Downloading albumentations-1.0.3-py3-none-any.whl (98 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 6.6 MB/s eta 0:00:00 Collecting tensorflow_addons==0.16.1 Downloading tensorflow_addons-0.16.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 54.4 MB/s eta 0:00:00 Requirement already satisfied: tensorflow-io==0.24.0 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 7)) (0.24.0) Requirement already satisfied: tensorboard==2.8.0 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 8)) (2.8.0) Collecting universal-pathlib==0.0.12 Downloading universal_pathlib-0.0.12-py3-none-any.whl (19 kB) Collecting setuptools==63.2.0 Downloading setuptools-63.2.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 58.9 MB/s eta 0:00:00 Collecting pynanosvg==0.3.1 Downloading pynanosvg-0.3.1.tar.gz (346 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 346.0/346.0 kB 17.5 MB/s eta 0:00:00 Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' "error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [6 lines of output] Traceback (most recent call last): File ""<string>"", line 2, in <module> File ""<pip-setuptools-caller>"", line 34, in <module> File ""/tmp/pip-install-1mt2gkfy/pynanosvg_d6162ffce95948abb4262061a011908c/setup.py"", line 2, in <module> from Cython.Build import cythonize ModuleNotFoundError: No module named 'Cython' [end of output] note: This error originates from a subprocess, and is likely not a problem with pip." error: metadata-generation-failed × Encountered error while generating package metadata. ╰─> See above for output. "note: This is an issue with the package mentioned above, not pip." hint: See above for details. [notice] A new release of pip available: 22.1.2 -> 22.2.1 "[notice] To update, run: pip install --upgrade pip" "2022-07-29 01:18:36,187 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code." "2022-07-29 01:18:36,187 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process." "2022-07-29 01:18:36,188 sagemaker-training-toolkit ERROR Reporting training FAILURE" "2022-07-29 01:18:36,188 sagemaker-training-toolkit ERROR InstallRequirementsError:" ExitCode 1 "ErrorMessage "" ModuleNotFoundError: No module named 'Cython' [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed × Encountered error while generating package metadata. ╰─> See above for output. note: This is an issue with the package mentioned above, not pip. hint: See above for details.""" "Command ""/usr/local/bin/python3.9 -m pip install -r requirements.txt""" "2022-07-29 01:18:36,188 sagemaker-training-toolkit ERROR Encountered exit_code 1"
1
answers
0
votes
141
views
asked 4 months ago

Cant generate XGBoost training report in sagemaker, only profiler_report.

I am trying to generate the XGBoost training report to see feature importances however the following code only generates the profiler report. ``` import boto3, re, sys, math, json, os, sagemaker, urllib.request from sagemaker import get_execution_role import numpy as np import pandas as pd from sagemaker.predictor import csv_serializer from sagemaker.debugger import Rule, rule_configs # Define IAM role rules=[ Rule.sagemaker(rule_configs.create_xgboost_report()) ] role = get_execution_role() prefix = 'sagemaker/models' my_region = boto3.session.Session().region_name # this line automatically looks for the XGBoost image URI and builds an XGBoost container. xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest") bucket_name = 'binary-base' s3 = boto3.resource('s3') try: if my_region == 'us-east-1': s3.create_bucket(Bucket=bucket_name) else: s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region }) print('S3 bucket created successfully') except Exception as e: print('S3 error: ',e) boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('../Data/Base_Model_Data_No_Labels/train.csv') boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'validation/val.csv')).upload_file('../Data/Base_Model_Data_No_Labels/val.csv') boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('../Data/Base_Model_Data/test.csv' sess = sagemaker.Session() xgb = sagemaker.estimator.Estimator(xgboost_container, role, volume_size =5, instance_count=1, instance_type='ml.m4.xlarge', output_path='s3://{}/{}/output'.format(bucket_name, prefix, 'xgboost_model'), sagemaker_session=sess, rules=rules) xgb.set_hyperparameters(objective='binary:logistic', num_round=100, scale_pos_weight=8.5) xgb.fit({'train': s3_input_train, "validation": s3_input_val}, wait=True) ``` When Checking the output path via: ``` rule_output_path = xgb.output_path + "/" + xgb.latest_training_job.job_name + "/rule-output" ! aws s3 ls {rule_output_path} --recursive ``` Which Outputs: ``` 2022-07-07 18:40:27 329715 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-report.html 2022-07-07 18:40:26 171087 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-report.ipynb 2022-07-07 18:40:23 191 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/BatchSize.json 2022-07-07 18:40:23 199 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/CPUBottleneck.json 2022-07-07 18:40:23 126 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/Dataloader.json 2022-07-07 18:40:23 127 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/GPUMemoryIncrease.json 2022-07-07 18:40:23 198 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/IOBottleneck.json 2022-07-07 18:40:23 119 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/LoadBalancing.json 2022-07-07 18:40:23 151 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/LowGPUUtilization.json 2022-07-07 18:40:23 179 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/MaxInitializationTime.json 2022-07-07 18:40:23 133 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/OverallFrameworkMetrics.json 2022-07-07 18:40:23 465 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/OverallSystemUsage.json 2022-07-07 18:40:23 156 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/StepOutlier.json ``` As you can see only the profiler report in created which does not interest me. Why isn't there a CreateXGBoostReport folder generated with the training report? How do I generate this/what am I missing?
0
answers
0
votes
58
views
asked 5 months ago

FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS [FileSystemInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput) or [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput). Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the `Preparing the instances for training` stage: ``` InternalServerError: We encountered an internal error. Please try again. ``` This error is persistent upon re-submission. What we figured out until now: - the job errors before the training image is downloaded. - specifying an invalid mount point leads to a proper error: ```ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.``` - the job finishes successfully when running locally with docker-compose ([Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase) with `instance_type="local"`). - we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group. How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?
1
answers
0
votes
81
views
Chris
asked 6 months ago

how to use sagemaker with input mode FastFile with files that has Chinese in their name?

This post is both a bug report and a question. We are trying to use SageMaker to train a model and everything is quite standard. Since we have a lot of images, we'll suffer from a super long image downloading time if we don't change the input_mode to FastFile. Then I struggled to successfully load image in the container. In my dataset there are a lot of samples whose name contains Chinese. When I started debugging because I could not properly load files, I found that when sagemaker mounts the data from s3, it didn't take care of the encoding correctly. Here is an image name and the image path inside the training container: `七年级上_第10章分式_七年级上_第10章分式_1077759_title_0-0_4_mathjax` `/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png` This is not neat but still I can have the right path in the container. The problem is that I'm not able to read the file even though the path exists: what I mean is `os.path.exists('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png')` gives true but `cv2.imread('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png')` returns None. Then I tried to open the file and fortunately it gives an error The code is `with open('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png', 'rb') as f: a = f.read() ` and it gives me the error `OSError: [Errno 107] Transport endpoint is not connected` I tried to load a file in the same folder whose name doesn't contain any Chinese. Everything works well in this case so I'm sure that the Chinese characters in the filenames are causing problems. I wonder if there is a quick walk around so I don't need to rename maybe 80% of the data in s3.
0
answers
0
votes
50
views
asked 7 months ago