Is it possible to use smddp in notebook?

0

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances ml.p3.16xlarge, by directly using smddp as backend in the training scripts. I launched the estimator by setting instance_type to local_gpu and ended up with smddp error. Corresponding errors are attached below, saying an initialization error.

42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 636, in <module>
42u1m0wni0-algo-1-36bbw | main()
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 178, in main
42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler
42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK"))
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise
42u1m0wni0-algo-1-36bbw |     raise _env_error(env_var)
42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set
42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0
42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs:
42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp
42u1m0wni0-algo-1-36bbw | hm.shutdown()
42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    Reporting training FAILURE
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
42u1m0wni0-algo-1-36bbw | ExitCode 1
42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw |  Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last):   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h"

The original goal is to launch a single-node smddp for debugging.

Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?

yzs
質問済み 2年前108ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ