By using AWS re:Post, you agree to theΒ Terms of Use

Questions tagged withΒ Amazon SageMaker Model Training

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

No such file or directory: '/opt/ml/input/data/test/revenue_train.csv' Sagemaker [SM_CHANNEL_TRAIN]

I am trying to deploy my RandomForestClassifier on Amazon Sagemaker using Python SDK. I have been following this example https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-script-mode/sagemaker-script-mode.ipynb but keep getting an error that the train file was not found. I think the file were not uploaded to the correct channel. When I run the script as follows it works fine. ``` ! python script_rf.py --model-dir ./ \ --train ./ \ --test ./ \ ``` This is my script code: ``` # inference functions --------------- def model_fn(model_dir): clf = joblib.load(os.path.join(model_dir, "model.joblib")) return clf if __name__ =='__main__': print('extracting arguments') parser = argparse.ArgumentParser() # hyperparameters sent by the client are passed as command-line arguments to the script. parser.add_argument('--max_depth', type=int, default=2) parser.add_argument('--n_estimators', type=int, default=100) parser.add_argument('--random_state', type=int, default=0) # Data, model, and output directories parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR')) parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TEST')) parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) parser.add_argument('--train-file', type=str, default='revenue_train.csv') parser.add_argument('--test-file', type=str, default='revenue_test.csv') args, _ = parser.parse_known_args() print('reading data') train_df = pd.read_csv(os.path.join(args.train, args.train_file)) test_df = pd.read_csv(os.path.join(args.test, args.test_file)) if len(train_df) == 0: raise ValueError(('There are no files in {}.\n').format(args.train, "train")) print('building training and testing datasets') attributes = ['available_minutes_100','ampido_slots_amount','ampido_slots_amount_100','ampido_slots_amount_200','ampido_slots_amount_300','min_dist_loc','count_event','min_dist_phouses','count_phouses','min_dist_stops','count_stops','min_dist_tickets','count_tickets','min_dist_google','min_dist_psa','count_psa'] X_train = train_df[attributes] X_test = test_df[attributes] y_train = train_df['target'] y_test = test_df['target'] # train print('training model') model = RandomForestClassifier( max_depth =args.max_depth, n_estimators = args.n_estimators) model.fit(X_train, y_train) # persist model path = os.path.join(args.model_dir, "model_rf.joblib") joblib.dump(model, path) print('model persisted at ' + path) # print accuracy and confusion matrix print('validating model') y_pred=model.predict(X_test) print('Confusion Matrix:') result = confusion_matrix(y_test, y_pred) print(result) print('Accuracy:') result2 = accuracy_score(y_test, y_pred) print(result2) ``` the error is raised in the train_df line of the script (FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/test/revenue_train.csv'). I tried specifying the input parameters: ``` # change channel input dirs inputs = { "train": "ampido-exports/production/revenue_train", "test": "ampido-exports/production/revenue_test", } from sagemaker.sklearn.estimator import SKLearn enable_local_mode_training = False hyperparameters = {"max_depth": 2, 'random_state':0, "n_estimators": 100} if enable_local_mode_training: train_instance_type = "local" inputs = {"train": trainpath, "test": testpath} else: train_instance_type = "ml.c5.xlarge" inputs = {"train": trainpath, "test": testpath} estimator_parameters = { "entry_point": "script_rf.py", "framework_version": "1.0-1", "py_version": "py3", "instance_type": train_instance_type, "instance_count": 1, "hyperparameters": hyperparameters, "role": role, "base_job_name": "randomforestclassifier-model", 'channel_input_dirs' : inputs } estimator = SKLearn(**estimator_parameters) estimator.fit(inputs) ``` but i still get the error FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/test/revenue_train.csv
1
answers
0
votes
29
views
asked a month ago

deploying previously trained model with Sagemaker Python SDK (StatusExceptionError)

I am using a pertained Random Forest Model and trying to deploy it on Amazon Sagemker using Python SDK: ``` from sagemaker.sklearn.estimator import SKLearn sklearn_estimator = SKLearn( entry_point='script.py', role = get_execution_role(), instance_count=1, instance_type='ml.m4.xlarge', framework_version='0.20.0', base_job_name='rf-scikit') sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False) sklearn_estimator.latest_training_job.wait(logs='None') artifact = m_boto3.describe_training_job( TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts'] print('Model artifact persisted at ' + artifact) ``` I get the following StatusException Error ``` 2022-08-25 12:03:27 Starting - Starting the training job.... 2022-08-25 12:03:52 Starting - Preparing the instances for training............ 2022-08-25 12:04:55 Downloading - Downloading input data...... 2022-08-25 12:05:31 Training - Downloading the training image......... 2022-08-25 12:06:22 Training - Training image download completed. Training in progress.. 2022-08-25 12:06:32 Uploading - Uploading generated training model. 2022-08-25 12:06:43 Failed - Training job failed --------------------------------------------------------------------------- UnexpectedStatusException Traceback (most recent call last) <ipython-input-37-628f942a78d3> in <module> ----> 1 sklearn_estimator.latest_training_job.wait(logs='None') 2 artifact = m_boto3.describe_training_job( 3 TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts'] 4 5 print('Model artifact persisted at ' + artifact) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 2109 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 2110 else: -> 2111 self.sagemaker_session.wait_for_job(self.job_name) 2112 2113 def describe(self): ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_job(self, job, poll) 3226 lambda last_desc: _train_done(self.sagemaker_client, job, last_desc), None, poll 3227 ) -> 3228 self._check_job_status(job, desc, "TrainingJobStatus") 3229 return desc 3230 ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 3390 message=message, 3391 allowed_statuses=["Completed", "Stopped"], -> 3392 actual_status=status, 3393 ) 3394 UnexpectedStatusException: Error for Training job rf-scikit-2022-08-25-12-03-25-931: Failed. Reason: AlgorithmError: framework error: Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main train(environment.Environment()) File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train runner_type=runner.ProcessRunnerType) File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run wait, capture_error File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run cwd=environment.code_dir, File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error info=extra_info, sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "" Command "/miniconda3/bin/python script.py" ExecuteUserScriptErr ``` The pertained model works fine and I don't know what the problem is, please help
1
answers
0
votes
19
views
asked a month ago

SageMaker Debugger: cannot load training information of estimator

I am using a SageMaker notebook for training a ML model. When I created and trained the estimator successfully with the following script, I could load the debugging information (s3_output_path) as expected: ``` from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs rules = [ Rule.sagemaker(rule_configs.loss_not_decreasing()), Rule.sagemaker(rule_configs.vanishing_gradient()), Rule.sagemaker(rule_configs.overfit()), Rule.sagemaker(rule_configs.overtraining()), Rule.sagemaker(rule_configs.poor_weight_initialization())] collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0", parameters={ "include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "100","eval.save_interval": "10"})] debugger_config = DebuggerHookConfig( collection_configs=collection_configs) estimator = PyTorch( role=sagemaker.get_execution_role(), instance_count=1, instance_type="ml.m5.xlarge", #instance_type="ml.g4dn.2xlarge", entry_point="train.py", framework_version="1.8", py_version="py36", hyperparameters=hyperparameters, debugger_hook_config=debugger_config, rules=rules, ) estimator.fit({"training": inputs}) s3_output_path = estimator.latest_job_debugger_artifacts_path() ``` After the kernel died, I attached the estimator and tried to access the debugging information of the training: ``` estimator = sagemaker.estimator.Estimator.attach('pytorch-training-2022-06-07-11-07-09-804') s3_output_path = estimator.latest_job_debugger_artifacts_path() rules_path = estimator.debugger_rules ``` The return values of these 2 functions were None. Could this be a problem with the attach-function? And how can I access training information of the debugger after the kernel was shut down?
1
answers
0
votes
58
views
asked a month ago

Sagemaker Training Job. Python modules installation Error

I have a problem with Python module installation that requires pre-installation of another module. Both modules were added to the `requirement.txt` file. However, the error occurs when installing main module: ```message 2022-07-29 01:18:26.460132: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler. "2022-07-29 01:18:26.470589: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped." 2022-07-29 01:18:26.765280: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler. "2022-07-29 01:18:31,908 sagemaker-training-toolkit INFO Imported framework sagemaker_tensorflow_container.training" "2022-07-29 01:18:31,917 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)" "2022-07-29 01:18:33,117 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt:" /usr/local/bin/python3.9 -m pip install -r requirements.txt Collecting Cython==0.29.31 Downloading Cython-0.29.31-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (2.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 33.1 MB/s eta 0:00:00 Requirement already satisfied: wheel==0.37.1 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 2)) (0.37.1) Collecting scikit-image==0.19.2 Downloading scikit_image-0.19.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 83.1 MB/s eta 0:00:00 Collecting parallelbar==0.1.19 Downloading parallelbar-0.1.19-py3-none-any.whl (5.6 kB) Collecting albumentations==1.0.3 Downloading albumentations-1.0.3-py3-none-any.whl (98 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 6.6 MB/s eta 0:00:00 Collecting tensorflow_addons==0.16.1 Downloading tensorflow_addons-0.16.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 54.4 MB/s eta 0:00:00 Requirement already satisfied: tensorflow-io==0.24.0 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 7)) (0.24.0) Requirement already satisfied: tensorboard==2.8.0 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 8)) (2.8.0) Collecting universal-pathlib==0.0.12 Downloading universal_pathlib-0.0.12-py3-none-any.whl (19 kB) Collecting setuptools==63.2.0 Downloading setuptools-63.2.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 58.9 MB/s eta 0:00:00 Collecting pynanosvg==0.3.1 Downloading pynanosvg-0.3.1.tar.gz (346 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 346.0/346.0 kB 17.5 MB/s eta 0:00:00 Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'error' "error: subprocess-exited-with-error Γ— python setup.py egg_info did not run successfully. β”‚ exit code: 1 ╰─> [6 lines of output] Traceback (most recent call last): File ""<string>"", line 2, in <module> File ""<pip-setuptools-caller>"", line 34, in <module> File ""/tmp/pip-install-1mt2gkfy/pynanosvg_d6162ffce95948abb4262061a011908c/setup.py"", line 2, in <module> from Cython.Build import cythonize ModuleNotFoundError: No module named 'Cython' [end of output] note: This error originates from a subprocess, and is likely not a problem with pip." error: metadata-generation-failed Γ— Encountered error while generating package metadata. ╰─> See above for output. "note: This is an issue with the package mentioned above, not pip." hint: See above for details. [notice] A new release of pip available: 22.1.2 -> 22.2.1 "[notice] To update, run: pip install --upgrade pip" "2022-07-29 01:18:36,187 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code." "2022-07-29 01:18:36,187 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process." "2022-07-29 01:18:36,188 sagemaker-training-toolkit ERROR Reporting training FAILURE" "2022-07-29 01:18:36,188 sagemaker-training-toolkit ERROR InstallRequirementsError:" ExitCode 1 "ErrorMessage "" ModuleNotFoundError: No module named 'Cython' [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Γ— Encountered error while generating package metadata. ╰─> See above for output. note: This is an issue with the package mentioned above, not pip. hint: See above for details.""" "Command ""/usr/local/bin/python3.9 -m pip install -r requirements.txt""" "2022-07-29 01:18:36,188 sagemaker-training-toolkit ERROR Encountered exit_code 1"
1
answers
0
votes
57
views
asked 2 months ago

Cant generate XGBoost training report in sagemaker, only profiler_report.

I am trying to generate the XGBoost training report to see feature importances however the following code only generates the profiler report. ``` import boto3, re, sys, math, json, os, sagemaker, urllib.request from sagemaker import get_execution_role import numpy as np import pandas as pd from sagemaker.predictor import csv_serializer from sagemaker.debugger import Rule, rule_configs # Define IAM role rules=[ Rule.sagemaker(rule_configs.create_xgboost_report()) ] role = get_execution_role() prefix = 'sagemaker/models' my_region = boto3.session.Session().region_name # this line automatically looks for the XGBoost image URI and builds an XGBoost container. xgboost_container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest") bucket_name = 'binary-base' s3 = boto3.resource('s3') try: if my_region == 'us-east-1': s3.create_bucket(Bucket=bucket_name) else: s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region }) print('S3 bucket created successfully') except Exception as e: print('S3 error: ',e) boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('../Data/Base_Model_Data_No_Labels/train.csv') boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'validation/val.csv')).upload_file('../Data/Base_Model_Data_No_Labels/val.csv') boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('../Data/Base_Model_Data/test.csv' sess = sagemaker.Session() xgb = sagemaker.estimator.Estimator(xgboost_container, role, volume_size =5, instance_count=1, instance_type='ml.m4.xlarge', output_path='s3://{}/{}/output'.format(bucket_name, prefix, 'xgboost_model'), sagemaker_session=sess, rules=rules) xgb.set_hyperparameters(objective='binary:logistic', num_round=100, scale_pos_weight=8.5) xgb.fit({'train': s3_input_train, "validation": s3_input_val}, wait=True) ``` When Checking the output path via: ``` rule_output_path = xgb.output_path + "/" + xgb.latest_training_job.job_name + "/rule-output" ! aws s3 ls {rule_output_path} --recursive ``` Which Outputs: ``` 2022-07-07 18:40:27 329715 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-report.html 2022-07-07 18:40:26 171087 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-report.ipynb 2022-07-07 18:40:23 191 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/BatchSize.json 2022-07-07 18:40:23 199 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/CPUBottleneck.json 2022-07-07 18:40:23 126 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/Dataloader.json 2022-07-07 18:40:23 127 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/GPUMemoryIncrease.json 2022-07-07 18:40:23 198 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/IOBottleneck.json 2022-07-07 18:40:23 119 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/LoadBalancing.json 2022-07-07 18:40:23 151 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/LowGPUUtilization.json 2022-07-07 18:40:23 179 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/MaxInitializationTime.json 2022-07-07 18:40:23 133 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/OverallFrameworkMetrics.json 2022-07-07 18:40:23 465 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/OverallSystemUsage.json 2022-07-07 18:40:23 156 sagemaker/models/output/xgboost-2022-07-07-18-35-55-436/rule-output/ProfilerReport-1657218955/profiler-output/profiler-reports/StepOutlier.json ``` As you can see only the profiler report in created which does not interest me. Why isn't there a CreateXGBoostReport folder generated with the training report? How do I generate this/what am I missing?
0
answers
0
votes
42
views
asked 3 months ago

FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError

We are submitting a Sagemaker Training job with Sagemaker SDK with a custom docker image. The job finishes successfully for EFS [FileSystemInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput) or [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput). Trying to use the FIleSystemInput with FSxLustre configuration leads to the training job dying during the `Preparing the instances for training` stage: ``` InternalServerError: We encountered an internal error. Please try again. ``` This error is persistent upon re-submission. What we figured out until now: - the job errors before the training image is downloaded. - specifying an invalid mount point leads to a proper error: ```ClientError: Unable to mount file system: xxx directory path: yyy. Incorrect mount path. Please ensure the mount path specified exists on the filesystem.``` - the job finishes successfully when running locally with docker-compose ([Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase) with `instance_type="local"`). - we can mount the FSx file system on an EC2 instance with the TrainingJob's VPC and security group. How can we narrow the problem down further and get more information about the failure reason? Can you suggest likely problems that could cause this behavior?
1
answers
0
votes
55
views
asked 3 months ago