PyTorch DDP on Sagemaker

0

Hi, I am using pytorch ddp on sagemaker. It is using mpi and running 4 separate processes on each of the 4 GPUs in g4dn.12xlarge. In pytorch, this is called ddp_sapwn, right? Is there a way to force DDP instead of ddp_spawn on Sagemaker? My distribution argument is

distribution = { 
    "pytorchddp": {
        "enabled": True,
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}

Arun Lokanatha
1 年前
SageMaker use DDP essentially it runs a separate process on each GPU and use DDP on each GPU to initialize and run.

主題

機器學習與 AI

標籤

Amazon SageMaker

語言

English

rePost-User-3588575

已提問 1 年前檢視次數 78 次

沒有答案

最新
最多得票
最多評論

相關內容

amazon polly使用字數異常增加(The characters increased for no reason)
ys_1004
已提問 9 個月前
[Glue Studio] Data target 選了 partition key，執行 job 卻說 Partition column not found in schema
anderson
已提問 9 個月前
Kinesis Video Streams in WebRTC應用問題
ShuoYue
已提問 8 個月前
[AWS Glue]context.py:79: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead
已接受的答案
Cole
已提問 1 年前
為什麼我無法啟動 Amazon SageMaker Studio？
AWS 官方已更新 1 年前
如何在 Amazon SageMaker 筆記本執行個體上將 Python 套件安裝到 Conda 環境？
AWS 官方已更新 2 年前
如何解決 Amazon EMR 上 Spark 中的 "Container killed on request.Exit code is 137" 錯誤？
AWS 官方已更新 2 年前
當我將自訂容器帶到 Amazon SageMaker 進行訓練或推論時，如何針對問題進行疑難排解？
AWS 官方已更新 2 年前