am using 4 gpu g5.12xlarge instance to train object detection models. each iteration takes 1 sec. this week when i started new training in 3 different instances. training is taking more time. 2.5 sec

0

am using 4 gpu g5.12xlarge instance to train object detection models. each iteration takes 1 sec. this week when i started new training in 3 different instances. training is taking more time. 2.5 sec for iteration. i didnt do any change in the model

1 Answer
1

The situation you're describing, where your object detection model training has suddenly slowed down without any changes to the model itself, is unusual and could be caused by several factors.

First, it's important to note that the G5.12xlarge instance you're using is indeed suitable for object detection model training. These instances are part of the GPU families recommended for such tasks, offering good performance for deep learning workloads.

However, the significant increase in iteration time from 1 second to 2.5 seconds across multiple instances suggests a systemic issue rather than an isolated incident. Here are a few possibilities to consider:

  1. Instance availability and performance: There might be underlying hardware or capacity issues affecting the G5 instances in your region. AWS occasionally faces high demand for specific instance types, which can impact performance.

  2. Network or storage bottlenecks: If your training data is being fetched from S3 or another storage service, there could be network congestion or storage performance issues slowing down data loading.

  3. Background processes: Check if there are any new background processes or updates running on the instances that could be consuming resources.

  4. Changes in the AWS environment: Recent updates or changes to the AWS infrastructure might have inadvertently affected performance.

  5. Dataset changes: Even if you haven't changed the model, any changes to the training data or how it's being loaded could impact training speed.

To troubleshoot this issue:

  1. Monitor the GPU utilization and memory usage during training to ensure the GPUs are being fully utilized.

  2. Try training on a different instance type, such as P3 or G4dn, to see if the problem persists.

  3. Check CloudWatch metrics for your instances to identify any unusual patterns in CPU, network, or disk usage.

  4. Verify that your training data is being accessed efficiently and there are no bottlenecks in data loading.

  5. Consider reaching out to AWS support for a more in-depth investigation of the instance performance.

If the problem continues, it may be worth exploring distributed training across multiple instances to potentially mitigate the slowdown and improve overall training time.
Sources
Object Detection - MXNet - Amazon SageMaker AI
Scaling training - Amazon SageMaker AI

profile picture
answered 19 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions