I want to create a Training Job on Sagemaker and associate both performance metrics and a model artifact with it. However, I have two problems with this:
- In the Sagemajer "experiments" section, I see that two runs are created for one run of the notebook. One contains the performance metrics (this is the run I created manually), the other one contains the artifact (this run is created automatically through the training job)
- I tried to circumvent this problem by explicitly attaching the artifact file to the "manual" run through
run.log_file(file_path=filepath, name="model")
. This should upload the file to some S3 and attach it to the run. However, I get a the following error, indicating that the S3 bucket is not accessible: botocore.exceptions.ClientError: An error occurred (404) when calling the HeadBucket operation: Not Found
.
My questions:
- How to avoid the creation of two runs in the first place, so that I have one run with both metrics and artifact attached?
- Where can I change the settings for my training job so that it has access to the S3 bucket and can upload the artifact file?
Here is my code in a shortened version:
1. my notebook
sm_session = sagemaker.Session(
sagemaker_config=sagemaker.config.config.load_sagemaker_config()
)
...
with Run(
experiment_name=experiment_name,
run_name=run_name,
run_display_name=run_name,
sagemaker_session=sm_session
) as run:
experiment_config = run.experiment_config
experiment_config.update({
"TrialName": run_name,
"TrialComponentDisplayName": run_name,
})
estimator = SKLearn(
source_dir="training_job",
entry_point="train.py",
dependencies=["."],
framework_version="1.2-1",
instance_type="ml.m5.large",
disable_output_compression=True,
sagemaker_session=sm_session,
experiment_config=experiment_config,
)
2. train.py
import os
from sagemaker.session import Session
boto_session = boto3.session.Session(region_name=os.environ["AWS_REGION"])
sagemaker_session = Session(boto_session=boto_session)
if __name__ == "__main__":
...
filepath = f"{os.environ['SM_MODEL_DIR']}/{args.model_name}.joblib"
with load_run(sagemaker_session=sagemaker_session) as run:
run.log_metric(
f"validation_{metric}", fold_metric
)
# The following produces the ClientError
run.log_file(file_path=filepath, name="model")