SageMaker Model Monitor Data Quality on Random Cut Forest configuration issues
I am trying to set up a Random Cut Forest model with a Data Quality job attached. I managed to train and deploy the model with the "data_capture" feature enabled.
# Training rcf = sagemaker.RandomCutForest( role=role, instance_count=1, instance_type='ml.m4.xlarge', data_location=f"s3://{BUCKET}/random_cut_forest/input", output_path=f's3://{BUCKET}/random_cut_forest/output', num_sample_per_tree=1024, num_trees=50, serializer=JSONSerializer(), deserializer=CSVDeserializer() ) rs = rcf.record_set(df_multi_measurements.drop("datetime", axis=1).to_numpy()) rcf.fit(rs, wait=False)
# Deploy data_capture_config = DataCaptureConfig( enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path ) rcf_inference = rcf.deploy( initial_instance_count=1, instance_type='ml.m4.xlarge', endpoint_name=ENDPOINT_NAME, data_capture_config=data_capture_config, serializer=CSVSerializer(), deserializer=JSONDeserializer(), )
Then, I configured and started the ModelMonitor job
# Model Monitor my_default_monitor = DefaultModelMonitor( role=role, instance_count=1, instance_type="ml.m4.xlarge", volume_size_in_gb=5, max_runtime_in_seconds=3600 ) my_default_monitor.suggest_baseline( baseline_dataset=baseline_data_uri + "/df_multi_measurements.csv", dataset_format=DatasetFormat.csv(header=True), output_s3_uri=baseline_results_uri, wait=True, logs=False ) my_default_monitor.create_monitoring_schedule( monitor_schedule_name=mon_schedule_name, endpoint_input=rcf_inference.endpoint, output_s3_uri=s3_report_path, statistics=my_default_monitor.baseline_statistics(), constraints=my_default_monitor.suggested_constraints(), schedule_cron_expression=CronExpressionGenerator.hourly(), enable_cloudwatch_metrics=True, )
But at the first run of the job I got this error:
Error: Encoding mismatch: Encoding is CSV for endpointInput, but Encoding is JSON for endpointOutput. We currently only support the same type of input and output encoding at the moment.
Data captured looked like:
{"captureData":{"endpointInput":{"observedContentType":"text/csv","mode":"INPUT","data":"4.150000013333333,3.330000003333333,...","encoding":"CSV"},"endpointOutput":{"observedContentType":"application/json","mode":"OUTPUT","data":"{\"scores\":[{\"score\":0.5794829282}]}","encoding":"JSON"}},"eventMetadata":{"eventId":"79add993-68cf-4903-9dfe-8275d164496f","inferenceTime":"2023-03-17T14:10:08Z"},"eventVersion":"0"}
...
So later I tried to force input and output to be both CSV but no luck.
After some tuning, I managed to instruct DataCapture to only collect requests in JSON so, since I couldn't change the output, now DataCapture has both input and output in the same (JSON) form.
The JSON requests look like this:
{ "instances": [ { 'features': [3.8600000533333336, 3.5966666533333336...] }, ... ] }
and the model correctly works, returning its predictions:
b'{"scores":[{"score":0.6015237349},...]}'
Data captured now looks like:
{"captureData":{"endpointInput":{"observedContentType":"application/json","mode":"INPUT","data":"{\"instances\": [{\"features\": [3.8600000533333336, 3.5966666533333336, ...]}]}","encoding":"JSON"},"endpointOutput":{"observedContentType":"application/json","mode":"OUTPUT","data":"{\"scores\":[{\"score\":0.6015237349},{\"score\":0.4439660733},{\"score\":0.5100689867},{\"score\":0.5456048291},{\"score\":0.5099260466}]}","encoding":"JSON"}},"eventMetadata":{"eventId":"27e2c9cd-3301-419c-8d06-9ede4c6380e6","inferenceTime":"2023-03-17T17:10:18Z"},"eventVersion":"0"}
BUT... at the first run of this new configuration, the job returns an error on the data analysis part.
So, after some search, I found that model monitor only works with tabular data or plain json, so I added a preprocessing step into the ModelMonitor https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html
The preprocessing script looks like this:
import json
import random
"""
{
"instances": [
{
'features': [3.8600000533333336, 3.5966666533333336...]
}
...
]
}
"""
def preprocess_handler(inference_record):
input_record = inference_record.endpoint_input.data
print(input_record)
input_record_dict = json.loads(input_record)
features = input_record_dict["instances"][0]['features']
return { str(i).zfill(20) : d for i, d in enumerate(features) }
And now, at the first run, again, I get an error that this time is absolutely NOT understandable at all:
2023-03-17 18:08:46,326 ERROR Main: No usable value for features
2023-03-17T19:08:46.935+01:00 No usable value for completeness
2023-03-17T19:08:46.935+01:00 Did not find value which can be converted into double
At this stage I feel a bit stuck. How can this be fixed? RCF and ModelMonitor should be easier to be integrated in my opinion.
What I am doing wrong?
- Mais recentes
- Mais votos
- Mais comentários
Conteúdo relevante
- AWS OFICIALAtualizada há um ano
- AWS OFICIALAtualizada há 3 anos
- AWS OFICIALAtualizada há um ano
- AWS OFICIALAtualizada há 2 anos