I want to use the model monitor in the sagemaker pipeline to get the results of statistics.json and constraints.json

0

Summary.

  • I would like to know what the CSV format is for getting aws sagemaker model quality metrics (regression metrics).

Contents

  • The model monitor in sagemaker is used to evaluate models (own models, batch processing). At this time, I believe that using the AWS mechanism, I can get statistics.json and constraints.json as a result.
    In the model monitor mechanism, there are three parameters for "problem_type" in the following script. I think there are three parameters for "regression", "BinaryClassification" and "MulticlassClassification", but I don't know what CSV format to send the content in for "regression", and I'm not sure if AWS will handle it appropriately. Question.
    Incidentally, in the case of classification, 'BinaryClassification' and 'MulticlassClassification' results from the creation of appropriate CSVs.

The CSV in the case of classification is created below.

probability,	prediction,	label
[0.99,0.88], 0, 0
[0.34,0.77], 1, 0
・・・
  • I don't know what kind of CSV content I should send to get the 'Regression' results.
    Here are some excerpts from pipeline.py and the CSV content we are currently testing.

Here are some excerpts from 'pipeline.py

    model_quality_check_config = ModelQualityCheckConfig(
        baseline_dataset=step_transform.properties.TransformOutput.S3OutputPath,
        dataset_format=DatasetFormat.csv(header=False),
        output_s3_uri= f"sagemaker/monitor/",
        problem_type='Regression',
        inference_attribute= "_c1",
        ground_truth_attribute= "_c0"
    )

    model_quality_check_step = QualityCheckStep(
        name="ModelQualityCheckStep",
        depends_on = ["lambdacsvtransform"],
        skip_check=skip_check_model_quality,
        register_new_baseline=register_new_baseline_model_quality,
        quality_check_config=model_quality_check_config,
        check_job_config=check_job_config,
        supplied_baseline_statistics=supplied_baseline_statistics_model_quality,
        supplied_baseline_constraints=supplied_baseline_constraints_model_quality,
        model_package_group_name="group"
    )

CSV(Regression)

"_c0", "_c0"
0.88, 0.99
0.66, 0.87
・・・
asked 2 years ago866 views
1 Answer
0

Hello,

When you say "you do not know what CSV content I should send to get the 'Regression' results", are you referring to the ContentType for your dataset?

Firstly, the line below means that the features/column names in the training dataset are not provided as the first row:

dataset_format=DatasetFormat.csv(header=False)

Please see the link [1] for more information on the above parameter. Basically, your dataset content is a comma-separated value file but for this particular scenario header = False as there is no column names provided for the training dataset.

I believe your overall question has to do with MetricsSource [2] object that would be defined as part of the ModelMetrics module, as in what will be the ContentType value used for your use case.

When it comes to MetricsSource object, if you consider the example "SageMaker Pipelines integration with Model Monitor and Clarify" from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/model-monitor-clarify-pipelines/sagemaker-pipeline-model-monitor-clarify-steps.ipynb, you will see that there is code that looks like the following:

    model_statistics=MetricsSource(
        s3_uri=model_quality_check_step.properties.CalculatedBaselineStatistics,
        content_type="application/json",
    ),
    model_constraints=MetricsSource(
        s3_uri=model_quality_check_step.properties.CalculatedBaselineConstraints,
        content_type="application/json",
    ),

As you can see from the above that content_type is "application/json"

The data you shared below suggest that the Pipeline has been deployed as an endpoint [3], as there is a deploy module that is supported.

But from running the example "SageMaker Pipelines integration with Model Monitor and Clarify", I was able to get the files

s3://S3-BUCKET-NAME-HERE/some-prefix/modelqualitycheckstep/statistics.json 
s3://S3-BUCKET-NAME-HERE/some-prefix/modelqualitycheckstep/constraints.json

where statistics.json contains the following:

{
  "version": 0,
  "dataset": {
    "item_count": 627,
    "evaluation_time": "2022-10-23T10:49:40.638Z"
  },
  "regression_metrics": {
    "mae": {
      "value": 1.4107242246563925,
      "standard_deviation": 0.025615074935394368
    },
    "mse": {
      "value": 3.9022604063585753,
      "standard_deviation": 0.23140761659194883
    },
    "rmse": {
      "value": 1.9754139835382798,
      "standard_deviation": 0.05901487899216817
    },
    "r2": {
      "value": 0.40614751710172436,
      "standard_deviation": 0.03121704707239033
    }
  }
}

From the above, one can see that if the MetricsSource Object is declared then the metrics published for a regression problem type.

Hope I answered the question properly, if not please reach out to AWS Support[4] (SageMaker), explain your issue/use case in detail and share relevant AWS resource names (plus CloudWatch logs).

References:

[1] https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.dataset_format.DatasetFormat

[2] https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_MetricsSource.html

[3] https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html

[4] https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
SUPPORT ENGINEER
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions