Is it possible to use smddp in notebook?

0

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances ml.p3.16xlarge, by directly using smddp as backend in the training scripts. I launched the estimator by setting instance_type to local_gpu and ended up with smddp error. Corresponding errors are attached below, saying an initialization error.

42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 636, in <module>
42u1m0wni0-algo-1-36bbw | main()
42u1m0wni0-algo-1-36bbw |   File "true_main_notebook.py", line 178, in main
42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator)
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler
42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK"))
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise
42u1m0wni0-algo-1-36bbw |     raise _env_error(env_var)
42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set
42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0
42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs:
42u1m0wni0-algo-1-36bbw | Traceback (most recent call last):
42u1m0wni0-algo-1-36bbw |   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp
42u1m0wni0-algo-1-36bbw | hm.shutdown()
42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    Reporting training FAILURE
42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
42u1m0wni0-algo-1-36bbw | ExitCode 1
42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
42u1m0wni0-algo-1-36bbw |  Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last):   File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h"

The original goal is to launch a single-node smddp for debugging.

Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?

Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen