I'm hoping to use SageMaker Training Compiler with a (Hugging Face Trainer API, PyTorch) program split across multiple .py files for maintainability. The job needs to run on multiple GPUs (although at the current scale, multi-device single-node would be acceptable).
Following the docs, I added the distributed_training_launcher.py
launcher script to my source_dir
bundle, and passed in the true training script via a training_script
hyperparameter.
...But when the job tries to start, I get:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 90, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 86, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_gpus)
AttributeError: module 'train' has no attribute '_mp_fn'
Any ideas what might be causing this? Is there some particular limitation or additional requirement for training scripts that are written over multiple files?
I also tried running in single-GPU mode (p3.2xlarge
) instead - directly calling the train script instead of the distributed launcher - and saw the below error which seems to originate within TrainingArguments itself? Not sure why it's trying to call a 'tensorflow/compiler' compiler when running in PT..?
EDIT: Turns out the below error can be solved by explicitly setting n_gpus
as mentioned on the troubleshooting doc - but that takes me back to the error message above
File "/opt/ml/code/code/config.py", line 124, in __post_init__
super().__post_init__()
File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 761, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 975, in device
return self._setup_devices
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1754, in __get__
cached = self.fget(obj)
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 918, in _setup_devices
device = xm.xla_device()
File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
devices = get_xla_supported_devices(
File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
xla_devices = _DEVICES.value
File "/opt/conda/lib/python3.8/site-packages/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration