如何使用SM分布式训练编译器打包多个文件源目录?

0

【以下的问题经过翻译处理】 我因为考虑可维护性方面,使用SageMaker Training Compiler和[Hugging Face Trainer API,PyTorch]程序分割成多个.py文件。该作业需要在多个GPU上运行(尽管在当前规模下,多设备单节点也可以接受)。

遵循文档上的步骤,将distributed_training_launcher.py启动脚本添加到source_dir中并通过training_script超参数传递真正的训练脚本。

...但当作业尝试启动时,我会得到以下错误:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 90, in <module>
main()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/distributed/xla_spawn.py", line 86, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_gpus)
AttributeError: module 'train' has no attribute '_mp_fn'

有什么想法可能导致这个问题?对于编写在多个文件中的训练脚本是否有特定的限制或额外要求?

我还尝试以单GPU模式(p3.2xlarge)运行,直接调用训练脚本而不是使用分布式启动器,并且看到下面的错误,似乎是源自于TrainingArguments本身?不确定为什么在运行PT时它尝试调用'tensorflow/compiler'编译器..?

编辑:后来发现下面的错误可以通过在故障排除文档中提到的显式设置n_gpus来解决,但这让我回到了上面的错误消息。

File "/opt/ml/code/code/config.py", line 124, in __post_init__
super().__post_init__()
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 761, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 975, in device
return self._setup_devices
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1754, in __get__
cached = self.fget(obj)
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1764, in wrapper
return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 918, in _setup_devices
device = xm.xla_device()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
devices = get_xla_supported_devices(
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
xla_devices = _DEVICES.value
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/utils/utils.py", line 32, in value
self._value = self._gen_fn()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
_DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration
profile picture
专家
已提问 7 个月前19 查看次数
1 回答
0

【以下的回答经过翻译处理】 噢,我之前就解决了这个问题,但忘记更新了。是的,训练脚本需要定义“_mp_fn”(可以执行与“if name ==“main”运行相同的代码),并且GPU数量(至少在我上次检查时如此——希望这在未来可以改变)需要明确配置。 对于我的特定项目,使现有作业启用SMTC的修复程序可在此处在线获得(https://github.com/aws-samples/amazon-textract-transformer-pipeline/pull/14/commits/45fa386faa3eee527395251449e6a58e3fb5f13c)。对于其他人,建议参考[官方SMTC示例笔记本和脚本](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-examples-and-blogs.html》!

profile picture
专家
已回答 7 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则