如何使用SM分布式训练编译器打包多个文件源目录?

0

【以下的问题经过翻译处理】 我因为考虑可维护性方面,使用SageMaker Training Compiler和[Hugging Face Trainer API,PyTorch]程序分割成多个.py文件。该作业需要在多个GPU上运行(尽管在当前规模下,多设备单节点也可以接受)。

遵循文档上的步骤,将distributed_training_launcher.py启动脚本添加到source_dir中并通过training_script超参数传递真正的训练脚本。

...但当作业尝试启动时,我会得到以下错误:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 90, in <module>
main()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 86, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_gpus)
AttributeError: module 'train' has no attribute '_mp_fn'

有什么想法可能导致这个问题?对于编写在多个文件中的训练脚本是否有特定的限制或额外要求?

我还尝试以单GPU模式(p3.2xlarge)运行,直接调用训练脚本而不是使用分布式启动器,并且看到下面的错误,似乎是源自于TrainingArguments本身?不确定为什么在运行PT时它尝试调用'tensorflow/compiler'编译器..?

编辑:后来发现下面的错误可以通过在故障排除文档中提到的显式设置n_gpus来解决,但这让我回到了上面的错误消息。

File "/opt/ml/code/code/config.py", line 124, in __post_init__
super().__post_init__()
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 761, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 975, in device
return self._setup_devices
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1754, in __get__
cached = self.fget(obj)
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 918, in _setup_devices
device = xm.xla_device()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
devices = get_xla_supported_devices(
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
xla_devices = _DEVICES.value
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration
profile picture
EXPERTE
gefragt vor 8 Monaten22 Aufrufe
1 Antwort
0

【以下的回答经过翻译处理】 噢,我之前就解决了这个问题,但忘记更新了。是的,训练脚本需要定义“_mp_fn”(可以执行与“if name ==“main”运行相同的代码),并且GPU数量(至少在我上次检查时如此——希望这在未来可以改变)需要明确配置。 对于我的特定项目,使现有作业启用SMTC的修复程序可在此处在线获得(https://github.com/aws-samples/amazon-textract-transformer-pipeline/pull/14/commits/45fa386faa3eee527395251449e6a58e3fb5f13c)。对于其他人,建议参考[官方SMTC示例笔记本和脚本](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-examples-and-blogs.html》!

profile picture
EXPERTE
beantwortet vor 8 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen