Questions tagged with Amazon SageMaker

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Error when saving custom metrics in SageMaker Experiments through SageMaker Pipelines Training Job

IHAC that I am working on enabling sagemaker experiments through a training job using SageMaker Pipelines. The below is the logic inserted into the train script which was working fine a few days ago tracking custom metrics into the trial component created by SageMaker Pipelines. ``` try: print('>>> Loading an existing trial component') my_tracker = Tracker.load() except ValueError: print('>>> Creating a new trial component') my_tracker = Tracker.create() my_tracker.log_metric("mse:mse error", mean_squared_error(valid_y, preds)) my_tracker.close() ``` However, since yesterday I am facing an error with running the same code with the following error: > >>> Loading an existing trial component Traceback (most recent call last): File "training.py", line 82, in <module> my_tracker = Tracker.load() File "/miniconda3/lib/python3.7/site-packages/smexperiments/tracker.py", line 161, in load _ArtifactUploader(tc.trial_component_name, artifact_bucket, artifact_prefix, boto3_session), AttributeError: 'NoneType' object has no attribute 'trial_component_name' I tried to change the versions of sagemaker and sagemaker-experiments to an older version but still see the same issue. This code works when I trigger just the training job out of SageMaker Pipelines but shows the above error when running through SageMaker Pipelines. Any pointers how to fix this?
1
answers
0
votes
34
views
asked 11 days ago

Sagemaker Pipelines - Is it possible to use a TransformStep with the Catboost Estimator ?

Hi! I am trying to implement a Sagemaker Pipeline including the following steps (among other things): * **ProcessingStep**: processing script (PySparkProcessor) generating a train , validation and test dataset (csv) * **TrainingStep**: model training, CatBoost Estimator (https://docs.aws.amazon.com/sagemaker/latest/dg/catboost.html) * **TransformStep**: batch inference using the model on the test dataset (csv) The TransformStep returns the following error: **python3: can't open file 'serve': [Errno 2] No such file or directory** I wonder if I'm using TransformStep in the wrong way or if, at the moment, the use of TransformStep with the CatBoost model has not been implemented yet. Code: ``` [...] pyspark_processor = PySparkProcessor( base_job_name="sm-spark", framework_version="3.1", role=role_arn, instance_type="ml.m5.xlarge", instance_count=12, sagemaker_session=pipeline_session, max_runtime_in_seconds=2400, ) step_process_args = pyspark_processor.run( submit_app=os.path.join( s3_preprocess_script_dir, "preprocess.py" ), # Hack to fix cache hit submit_py_files=[os.path.join( s3_preprocess_script_dir, "preprocess_utils.py" ), os.path.join( s3_preprocess_script_dir, "spark_utils.py" )], outputs=[ ProcessingOutput( output_name="datasets", source="/opt/ml/processing/output", destination=s3_preprocess_output_path, ) ], arguments=["--aws_account", AWS_ACCOUNT, "--aws_env", AWS_ENV, "--project_name", PROJECT_NAME, "--mode", "training"], ) step_process = ProcessingStep( name="PySparkPreprocessing", step_args=step_process_args, cache_config=cache_config, ) train_model_id = "catboost-classification-model" train_model_version = "*" train_scope = "training" training_instance_type = "ml.m5.xlarge" # Retrieve the docker image train_image_uri = image_uris.retrieve( region=None, framework=None, model_id=train_model_id, model_version=train_model_version, image_scope=train_scope, instance_type=training_instance_type, ) # Retrieve the training script train_source_uri = script_uris.retrieve( model_id=train_model_id, model_version=train_model_version, script_scope=train_scope ) # Retrieve the pre-trained model tarball to further fine-tune train_model_uri = model_uris.retrieve( model_id=train_model_id, model_version=train_model_version, model_scope=train_scope ) training_job_name = name_from_base(f"jumpstart-{train_model_id}-training") # Create SageMaker Estimator instance tabular_estimator = Estimator( role=role_arn, image_uri=train_image_uri, source_dir=train_source_uri, model_uri=train_model_uri, entry_point="transfer_learning.py", instance_count=1, instance_type="ml.m5.xlarge", max_run=360000, hyperparameters=hyperparameters, sagemaker_session=pipeline_session, output_path=s3_training_output_path, disable_profiler=True, # The default profiler rule includes a timestamp which will change each time the pipeline is upserted, causing cache misses. If profiling is not needed, set disable_profiler to True on the estimator. ) # Launch a SageMaker Training job by passing s3 path of the training data step_train_args = tabular_estimator.fit( { "training": TrainingInput( s3_data=step_process.properties.ProcessingOutputConfig.Outputs[ "datasets" ].S3Output.S3Uri ) }, logs=True, job_name=training_job_name, ) step_train = TrainingStep( name="CatBoostTraining", step_args=step_train_args, cache_config=cache_config, ) script_eval = ScriptProcessor( image_uri=[MASKED], command=["python3"], instance_type="ml.m5.xlarge", instance_count=1, base_job_name="script-evaluation", role=role_arn, sagemaker_session=pipeline_session, ) eval_args = script_eval.run( inputs=[ ProcessingInput( source=step_train.properties.ModelArtifacts.S3ModelArtifacts, destination="/opt/ml/processing/model", ), ProcessingInput( source=step_process.properties.ProcessingOutputConfig.Outputs[ "datasets" ].S3Output.S3Uri, destination="/opt/ml/processing/input", ), ], outputs=[ ProcessingOutput( output_name="evaluation", source="/opt/ml/processing/evaluation", destination=s3_evaluation_output_path, ), ], code="common/evaluation.py", ) evaluation_report = PropertyFile( name="EvaluationReport", output_name="evaluation", path="evaluation.json" ) step_eval = ProcessingStep( name="Evaluation", step_args=eval_args, property_files=[evaluation_report], cache_config=cache_config, ) model = Model( image_uri="467855596088.dkr.ecr.eu-west-3.amazonaws.com/sagemaker-catboost-image:latest", model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts, sagemaker_session=pipeline_session, role=role_arn, ) evaluation_s3_uri = "{}/evaluation.json".format( step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"] ) model_step_args = model.create( instance_type="ml.m5.large", ) create_model = ModelStep(name="CatBoostModel", step_args=model_step_args) step_fail = FailStep( name="FailBranch", error_message=Join( on=" ", values=["Execution failed due to F1-score <", 0.8] ), ) cond_lte = ConditionGreaterThanOrEqualTo( left=JsonGet( step_name=step_eval.name, property_file=evaluation_report, json_path="classification_metrics.f1-score.value", ), right=f1_threshold, ) step_cond = ConditionStep( name="F1ScoreCondition", conditions=[cond_lte], if_steps=[create_model], else_steps=[step_fail], ) # Transform Job s3_test_transform_input = os.path.join(step_process.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"], "test") transformer = Transformer(model_name=create_model.properties.ModelName, instance_count=1, instance_type="ml.m5.xlarge", assemble_with="Line", accept="text/csv", output_path=s3_test_transform_output_path, sagemaker_session=pipeline_session) transform_step_args = transformer.transform( data=s3_test_transform_input, content_type="text/csv", split_type="Line", ) step_transform = TransformStep( name="InferenceTransform", step_args=transform_step_args, ) # Create and execute pipeline step_transform.add_depends_on([step_process, create_model]) pipeline = Pipeline( name=pipeline_name, steps=[step_process, step_train, step_eval, step_cond, step_transform], sagemaker_session=pipeline_session, ) pipeline.upsert(role_arn=role_arn, description=[MASKED]) execution = pipeline.start() execution.wait(delay=60, max_attempts=120) ```
2
answers
0
votes
47
views
HaPo
asked 17 days ago
1
answers
0
votes
46
views
asked 18 days ago

Help with Inference Script for Amazon Sagemaker Neo Compiled Models

Hello everyone, I was trying to execute the example mentioned in the docs - [https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_neo_compilation_jobs/pytorch_torchvision/pytorch_torchvision_neo.html](). I was able to successfully run this example but as soon as I changed the target_device to `jetson_tx2`, after which I ran the entire script again, keeping the rest of the code as it is, the model stopped working. I was not getting any inferences from the deployed model and it always errors out with the message: ``` An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from <users-sagemaker-endpoint> with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again." ``` According to the troubleshoot docs [https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-inference.html](), this seems to be an issue of **model_fn**() function. The inference script used by this example is mentioned here [https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_neo_compilation_jobs/pytorch_torchvision/code/resnet18.py]() , which itself doesn't contain any model_fn() definition but it still worked for target device `ml_c5`. So could anyone please help me with the following questions: 1. What changes does SageMaker Neo do to the model depending on `target_device` type? Since it seems the same model is loaded in a different way for different target device. 2. Is there any way to determine how the model is expected to load for a certain target_device type so that I could define the **model_fn**() function myself in the same inference script mentioned above? 3. At-last, can anyone please help with the inference script for this very same model as mentioned in docs above which works for `jetson_tx2` device as well. Any suggestions or links on how to resolve this issue would be really helpful.
1
answers
0
votes
33
views
Rupesh
asked 20 days ago