Multi-file source_dir bundle with SM Training Compiler (distributed)

0

I'm hoping to use SageMaker Training Compiler with a (Hugging Face Trainer API, PyTorch) program split across multiple .py files for maintainability. The job needs to run on multiple GPUs (although at the current scale, multi-device single-node would be acceptable).

Following the docs, I added the distributed_training_launcher.py launcher script to my source_dir bundle, and passed in the true training script via a training_script hyperparameter.

...But when the job tries to start, I get:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 90, in <module>
main()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 86, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_gpus)
AttributeError: module 'train' has no attribute '_mp_fn'

Any ideas what might be causing this? Is there some particular limitation or additional requirement for training scripts that are written over multiple files?

I also tried running in single-GPU mode (p3.2xlarge) instead - directly calling the train script instead of the distributed launcher - and saw the below error which seems to originate within TrainingArguments itself? Not sure why it's trying to call a 'tensorflow/compiler' compiler when running in PT..?

EDIT: Turns out the below error can be solved by explicitly setting n_gpus as mentioned on the troubleshooting doc - but that takes me back to the error message above

File "/opt/ml/code/code/config.py", line 124, in __post_init__
super().__post_init__()
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 761, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 975, in device
return self._setup_devices
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1754, in __get__
cached = self.fget(obj)
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 918, in _setup_devices
device = xm.xla_device()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
devices = get_xla_supported_devices(
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
xla_devices = _DEVICES.value
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration
AWS
專家
Alex_T
已提問 2 年前檢視次數 267 次
1 個回答
0
已接受的答案

Ahh I solved this a while ago and forgot to update -

Yes, the training script needs to define a _mp_fn (which can just execute the same code as gets run if __name__ == "__main__") and number of GPUs (at least the last time I checked - hopefully this could change in future) needs to be explicitly configured.

For my particular project the fix to enable SMTC on the existing job is available online here. For others would also suggest referring to the official SMTC example notebooks & scripts!

AWS
專家
Alex_T
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南