Is it possible to use smddp in notebook?

0

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances ml.p3.16xlarge, by directly using smddp as backend in the training scripts. I launched the estimator by setting instance_type to local_gpu and ended up with smddp error. Corresponding errors are attached below, saying an initialization error.

42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 636, in <module>
42u1m0wni0-algo-1-36bbw | main()
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 178, in main
42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler
42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK"))
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise
42u1m0wni0-algo-1-36bbw |     raise _env_error(env_var)
42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set
42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0
42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs:
42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp
42u1m0wni0-algo-1-36bbw | hm.shutdown()
42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    Reporting training FAILURE
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
42u1m0wni0-algo-1-36bbw | ExitCode 1
42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw |  Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last):   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h"

The original goal is to launch a single-node smddp for debugging.

Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?

yzs
已提問 2 年前檢視次數 108 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南