I want to use the model monitor in the sagemaker pipeline to get the results of statistics.json and constraints.json

0

Summary.

  • I would like to know what the CSV format is for getting aws sagemaker model quality metrics (regression metrics).

Contents

  • The model monitor in sagemaker is used to evaluate models (own models, batch processing). At this time, I believe that using the AWS mechanism, I can get statistics.json and constraints.json as a result.
    In the model monitor mechanism, there are three parameters for "problem_type" in the following script. I think there are three parameters for "regression", "BinaryClassification" and "MulticlassClassification", but I don't know what CSV format to send the content in for "regression", and I'm not sure if AWS will handle it appropriately. Question.
    Incidentally, in the case of classification, 'BinaryClassification' and 'MulticlassClassification' results from the creation of appropriate CSVs.

The CSV in the case of classification is created below.

probability,	prediction,	label
[0.99,0.88], 0, 0
[0.34,0.77], 1, 0
・・・
  • I don't know what kind of CSV content I should send to get the 'Regression' results.
    Here are some excerpts from pipeline.py and the CSV content we are currently testing.

Here are some excerpts from 'pipeline.py

    model_quality_check_config = ModelQualityCheckConfig(
        baseline_dataset=step_transform.properties.TransformOutput.S3OutputPath,
        dataset_format=DatasetFormat.csv(header=False),
        output_s3_uri= f"sagemaker/monitor/",
        problem_type='Regression',
        inference_attribute= "_c1",
        ground_truth_attribute= "_c0"
    )

    model_quality_check_step = QualityCheckStep(
        name="ModelQualityCheckStep",
        depends_on = ["lambdacsvtransform"],
        skip_check=skip_check_model_quality,
        register_new_baseline=register_new_baseline_model_quality,
        quality_check_config=model_quality_check_config,
        check_job_config=check_job_config,
        supplied_baseline_statistics=supplied_baseline_statistics_model_quality,
        supplied_baseline_constraints=supplied_baseline_constraints_model_quality,
        model_package_group_name="group"
    )

CSV(Regression)

"_c0", "_c0"
0.88, 0.99
0.66, 0.87
・・・
已提问 2 年前877 查看次数
1 回答
0

Hello,

When you say "you do not know what CSV content I should send to get the 'Regression' results", are you referring to the ContentType for your dataset?

Firstly, the line below means that the features/column names in the training dataset are not provided as the first row:

dataset_format=DatasetFormat.csv(header=False)

Please see the link [1] for more information on the above parameter. Basically, your dataset content is a comma-separated value file but for this particular scenario header = False as there is no column names provided for the training dataset.

I believe your overall question has to do with MetricsSource [2] object that would be defined as part of the ModelMetrics module, as in what will be the ContentType value used for your use case.

When it comes to MetricsSource object, if you consider the example "SageMaker Pipelines integration with Model Monitor and Clarify" from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/model-monitor-clarify-pipelines/sagemaker-pipeline-model-monitor-clarify-steps.ipynb, you will see that there is code that looks like the following:

    model_statistics=MetricsSource(
        s3_uri=model_quality_check_step.properties.CalculatedBaselineStatistics,
        content_type="application/json",
    ),
    model_constraints=MetricsSource(
        s3_uri=model_quality_check_step.properties.CalculatedBaselineConstraints,
        content_type="application/json",
    ),

As you can see from the above that content_type is "application/json"

The data you shared below suggest that the Pipeline has been deployed as an endpoint [3], as there is a deploy module that is supported.

But from running the example "SageMaker Pipelines integration with Model Monitor and Clarify", I was able to get the files

s3://S3-BUCKET-NAME-HERE/some-prefix/modelqualitycheckstep/statistics.json 
s3://S3-BUCKET-NAME-HERE/some-prefix/modelqualitycheckstep/constraints.json

where statistics.json contains the following:

{
  "version": 0,
  "dataset": {
    "item_count": 627,
    "evaluation_time": "2022-10-23T10:49:40.638Z"
  },
  "regression_metrics": {
    "mae": {
      "value": 1.4107242246563925,
      "standard_deviation": 0.025615074935394368
    },
    "mse": {
      "value": 3.9022604063585753,
      "standard_deviation": 0.23140761659194883
    },
    "rmse": {
      "value": 1.9754139835382798,
      "standard_deviation": 0.05901487899216817
    },
    "r2": {
      "value": 0.40614751710172436,
      "standard_deviation": 0.03121704707239033
    }
  }
}

From the above, one can see that if the MetricsSource Object is declared then the metrics published for a regression problem type.

Hope I answered the question properly, if not please reach out to AWS Support[4] (SageMaker), explain your issue/use case in detail and share relevant AWS resource names (plus CloudWatch logs).

References:

[1] https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.dataset_format.DatasetFormat

[2] https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_MetricsSource.html

[3] https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html

[4] https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
支持工程师
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则