Skip to content

Questions tagged with Amazon SageMaker Model Training

Amazon SageMaker reduces the time and cost to train and tune machine learning (ML) models without the need to manage infrastructure. With SageMaker, easily train and tune ML models using built-in tools to manage and track training experiments, automatically choose optimal hyperparameters, debug training jobs, and monitor the utilization of system resources such as GPUs, CPUs, and network bandwidth.

Content language: English

Filter questions
Select tags to filter
Sort by
Sort by most recent
Filter Questions by:

Browse through the questions and answers listed below or filter and sort to narrow down your results.

76 results
I am using Amazon SageMaker AI Model Monitor and trying to run an ON-DEMAND (data-quality) monitoring job using DefaultModelMonitor. Code: from sagemaker.model_monitor import DefaultModelMonitor, En...
2
answers
0
votes
38
views
asked 2 months ago
I'm running multi-node pretraining with LLaMA-Factory using ml.p4de.24xlarge on SageMaker. The job fails with this error: [rankX]: [c10d] While waitForInput, poolFD failed... torch.distributed.DistBa...
2
answers
0
votes
727
views
asked 8 months ago
In CloudWatch, SageMaker training jobs are found in the log group: `/aws/sagemaker/TrainingJobs`. The log stream name has the format: "{sagemaker_training_job_name}/algo-1-..." How can I programmatic...
1
answers
0
votes
249
views
AWS
asked a year ago
I would like to save the logs from a SageMaker training job, following something similar to the code snippet below. ```python estimator = JumpStartEstimator( model_id = "...", environment = {...
1
answers
0
votes
148
views
AWS
asked a year ago
Hi everyone. When I run the augtogluon algorithm the following error appears after trying to read the entry_point : ``` UnexpectedStatusException: Error for Training job builtIn-example-autogluon-...
0
answers
0
votes
117
views
asked a year ago
I am trying to train a SageMaker built-in KMeans model on data stored in RecordIO-Protobuf format, using the Pipe input mode. However, the training job fails with the following error: ``` UnexpectedSt...
1
answers
0
votes
116
views
asked a year ago
Hi everybody. I'm stuck when calling describe_auto_ml_job_v2 method. Can't find the best Candidate because of a KeyError. Seems like when I print the method the following keys fail after sm.describe...
2
answers
0
votes
133
views
asked a year ago
Hi, I am using Sagemaker TrainingJob and it fails when it tries to upload the mode artifact to a bucket that has objectlock enabled. It throws this error: ClientError: Artifact upload failed:Error 7:...
1
answers
0
votes
313
views
AWS
asked a year ago
**I followed the instruction in : https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-walkthrough-3rdgit.html** ![Enter image description here](/media/postImages/original/IMD2dEnWaoRY...
0
answers
0
votes
117
views
asked 2 years ago
I want to create a Training Job on Sagemaker and associate both performance metrics and a model artifact with it. However, I have two problems with this: * In the Sagemajer "experiments" section, I se...
1
answers
0
votes
357
views
asked 2 years ago
Hello, I have started running a command to train a model using Ultralytics YOLOv8.2.4. Most of the prerequisites should have already been installed. However whenever i run the cell, it will get stuck ...
1
answers
0
votes
713
views
asked 2 years ago
Hi team, I am currently working on developing an AWS application aimed at checking the compliance of identity photos with our organization's rules. This application will be utilized for various purpo...
1
answers
1
votes
715
views
asked 2 years ago
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • Page size
    12 / page