By using AWS re:Post, you agree to the Terms of Use
/Amazon SageMaker Studio Lab/

Questions tagged with Amazon SageMaker Studio Lab

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Is it possible to use smddp in notebook?

I recently tried the smddp v1.4.0 on SageMaker notebook instance (not sagemaker studio), using 8-GPU instances `ml.p3.16xlarge`, by directly using `smddp` as backend in the training scripts. I launched the estimator by setting `instance_type` to `local_gpu` and ended up with smddp error. Corresponding errors are attached below, saying an initialization error. ``` 42u1m0wni0-algo-1-36bbw | Traceback (most recent call last): 42u1m0wni0-algo-1-36bbw | File "true_main_notebook.py", line 636, in <module> 42u1m0wni0-algo-1-36bbw | main() 42u1m0wni0-algo-1-36bbw | File "true_main_notebook.py", line 178, in main 42u1m0wni0-algo-1-36bbw | dist.init_process_group(backend=args.dist_backend) 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group 42u1m0wni0-algo-1-36bbw | store, rank, world_size = next(rendezvous_iterator) 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _env_rendezvous_handler 42u1m0wni0-algo-1-36bbw | rank = int(_get_env_or_raise("RANK")) 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise 42u1m0wni0-algo-1-36bbw | raise _env_error(env_var) 42u1m0wni0-algo-1-36bbw | ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set 42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set 42u1m0wni0-algo-1-36bbw | Running smdistributed.dataparallel v1.4.0 42u1m0wni0-algo-1-36bbw | Error in atexit._run_exitfuncs: 42u1m0wni0-algo-1-36bbw | Traceback (most recent call last): 42u1m0wni0-algo-1-36bbw | File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp 42u1m0wni0-algo-1-36bbw | hm.shutdown() 42u1m0wni0-algo-1-36bbw | RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h 42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR Reporting training FAILURE 42u1m0wni0-algo-1-36bbw | 2022-04-03 16:07:30,005 sagemaker-training-toolkit ERROR ExecuteUserScriptError: 42u1m0wni0-algo-1-36bbw | ExitCode 1 42u1m0wni0-algo-1-36bbw | ErrorMessage "ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set 42u1m0wni0-algo-1-36bbw | Environment variable SAGEMAKER_INSTANCE_TYPE is not set Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/torch/torch_smddp/__init__.py", line 51, in at_exit_smddp hm.shutdown() RuntimeError: Was this script started with smddprun? For more info on using smddprun, run smddprun -h" ``` The original goal is to launch a single-node smddp for debugging. Does the smddp only support launched by AWS python SDK rather than the notebook? Or if something I've done is not correct?
0
answers
0
votes
3
views
asked 2 months ago
  • 1
  • 90 / page