- Newest
- Most votes
- Most comments
This NCCL socket timeout issue with large datasets in multi-node training is likely related to network communication timeouts during the NCCL initialization phase. Here are several approaches to resolve this:
-
Increase the socket timeout settings for your NCCL communication. The default timeout values may not be sufficient when working with large datasets that require more time to initialize across nodes.
-
For PyTorch Neuron training specifically, you can try reserving a specific port for communication by:
- Setting a reserved port using:
sudo sysctl -w net.ipv4.ip_local_reserved_ports=48620(or another available port) - Then setting the environment variable:
NEURON_RT_ROOT_COMM_ID="localhost:48620"(or your root node's IP)
- The error might also be related to GPU memory constraints. When working with large datasets, you might be running out of GPU memory during initialization. Try:
- Reducing your batch size
- Reducing the number of active microbatches
- Consider manual partitioning if auto partitioning isn't balancing the workload effectively
-
Check if your nodes have sufficient network bandwidth and if there are any network connectivity issues between the instances. The ml.p4de.24xlarge instances should have high network bandwidth, but there might be configuration issues.
-
Add additional environment variables to tune NCCL behavior:
NCCL_SOCKET_NTHREADS=8NCCL_NSOCKS_PERTHREAD=8NCCL_MIN_NCHANNELS=4
- Consider implementing a progressive loading strategy where you load the dataset in chunks rather than all at once to reduce the memory pressure during initialization.
These adjustments should help address the socket timeout issues when working with large datasets in multi-node training environments.
Sources
PyTorch Neuron (torch-neuronx) for Training Troubleshooting Guide — AWS Neuron Documentation
Model Parallel Troubleshooting - Amazon SageMaker AI
Hello,
To troubleshoot NCCL socket timeout issues, you can try the below steps:
-
Resource Utilization: First, I recommend checking your training job metrics to ensure resource utilization is within healthy levels. High CPU, memory, or GPU usage could lead to delays and timeouts. You can monitor these metrics using Amazon CloudWatch [1].
-
Configure longer socket timeout settings for NCCL communication to handle the extended initialization time required for large datasets across distributed nodes.
export NCCL_SOCKET_TIMEOUT=300
-
Consider leveraging SageMaker's data parallel library, which is specifically designed for large-scale distributed training on SageMaker infrastructure [2].
-
Network Performance: NCCL timeouts often stem from network latency or bandwidth issues. To diagnose this, you can use the iperf tool to measure network performance between nodes:
# On the server node
iperf -s
# On the client node
iperf -c <server-node-ip>
- NCCL Parameter Tuning: Adjusting NCCL parameters may help resolve timeout issues. Try increasing the NCCL timeout and enabling debug logging:
export NCCL_TIMEOUT=1000
export NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8
export NCCL_NSOCKS_PERTHREAD=8
export NCCL_TIMEOUT=3600
- Implement Incremental Data Loading: Instead of loading the entire dataset at once, use a streaming approach to reduce memory pressure.
References: [1] https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html [2] https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_pytorch.html
