I am using a SageMaker notebook for training a ML model. When I created and trained the estimator successfully with the following script, I could load the debugging information (s3_output_path) as expected:
from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs
rules = [
Rule.sagemaker(rule_configs.loss_not_decreasing()),
Rule.sagemaker(rule_configs.vanishing_gradient()),
Rule.sagemaker(rule_configs.overfit()),
Rule.sagemaker(rule_configs.overtraining()),
Rule.sagemaker(rule_configs.poor_weight_initialization())]
collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0", parameters={
"include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "100","eval.save_interval": "10"})]
debugger_config = DebuggerHookConfig(
collection_configs=collection_configs)
estimator = PyTorch(
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type="ml.m5.xlarge",
#instance_type="ml.g4dn.2xlarge",
entry_point="train.py",
framework_version="1.8",
py_version="py36",
hyperparameters=hyperparameters,
debugger_hook_config=debugger_config,
rules=rules,
)
estimator.fit({"training": inputs})
s3_output_path = estimator.latest_job_debugger_artifacts_path()
After the kernel died, I attached the estimator and tried to access the debugging information of the training:
estimator = sagemaker.estimator.Estimator.attach('pytorch-training-2022-06-07-11-07-09-804')
s3_output_path = estimator.latest_job_debugger_artifacts_path()
rules_path = estimator.debugger_rules
The return values of these 2 functions were None. Could this be a problem with the attach-function? And how can I access training information of the debugger after the kernel was shut down?