Is it possible to use smddp in notebook?

0

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances ml.p3.16xlarge, by directly using smddp as backend in the training scripts. I launched the estimator by setting instance_type to local_gpu and ended up with smddp error. Corresponding errors are attached below, saying an initialization error.

42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 636, in <module>
42u1m0wni0-algo-1-36bbw | main()
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 178, in main
42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler
42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK"))
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise
42u1m0wni0-algo-1-36bbw |     raise _env_error(env_var)
42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set
42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0
42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs:
42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp
42u1m0wni0-algo-1-36bbw | hm.shutdown()
42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    Reporting training FAILURE
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
42u1m0wni0-algo-1-36bbw | ExitCode 1
42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw |  Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last):   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h"

The original goal is to launch a single-node smddp for debugging.

Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions