Sagemaker preprocessing script output order not preserved

0

I have created a data monitoring schedule using the DefaultModelMonitor . In the monitoring schedule call I'm passing constraints and statistics that was generated using the baseline job. I'm also using preprocessing handler to transform the input. Order of features from my constraints.json file is as follows

"features" : [ {
    "name" : "most_recent_enrollment",
    "inferred_type" : "String",
    "completeness" : 1.0
  }, {
    "name" : "previous_enrollment",
    "inferred_type" : "String",
    "completeness" : 0.5401620734511657
  }, {
    "name" : "most_recent_click",
    "inferred_type" : "String",
    "completeness" : 0.6944516790395606
  }, {
    "name" : "locale",
    "inferred_type" : "String",
    "completeness" : 0.8397031338877001,
    "string_constraints" : {
      "domains" : [ "zh-TW", "hy", "es-419", "ko_KR", "bs", "fr_FR", "hr", "ta", "ka", "ar", "fr", "is", "lv", "uk_UA", "eu", "am", "bn", "uz", "uk", "fr_CA", "NullValue", "si", "ky", "pa", "ga", "pt-PT", "zh_CN", "cs", "lo", "gl", "sr", "zh-CN", "it_IT", "el", "it", "hu_HU", "ca", "pt-BR", "vi", "nl", "bg", "hi_IN", "ko", "th_TH", "or", "sv_SE", "mk", "et", "en_US", "de", "ur", "kk_KZ", "ru", "ml", "th", "id", "sq", "sr-Latn", "sv", "tr", "da", "my", "de_DE", "fil", "en", "gu", "he", "fr-CA", "kn", "sk", "ar_SA", "zh-HK", "en-US", "pt_BR", "tr_TR", "az", "es", "hi", "te", "mr", "sw", "be", "zh_TW", "en-GB", "nl_NL", "ja", "fi", "ja_JP", "ro", "pl_PL", "ne", "lt", "no", "id_ID", "ru_RU", "es_ES", "km", "kk", "sl", "es_LA", "fa", "mn", "el_GR", "es_419", "ms", "hu", "pl" ]
    }
  }, {
    "name" : "occupation_id",
    "inferred_type" : "String",
    "completeness" : 0.20719336371168762
  }, {
    "name" : "most_recent_skill_viewed",
    "inferred_type" : "String",
    "completeness" : 0.05563079534476376
  }]

Below is the preprocess_handler I use

def preprocess_handler(inference_record):
    input_data = inference_record.endpoint_input.data
    output_data = inference_record.endpoint_output.data

    result = {}
    result['most_recent_enrollment'] = output_data["recommendations"][0] if len(output_data["recommendations"]) > 0 else None
    result['previous_enrollment'] = input_data['inputs']['most_recent_enrollment']
    result['most_recent_click'] = input_data['inputs']['most_recent_xdp_view']
    result['locale'] = input_data['inputs']['locale']
    result['occupation_id'] = input_data['inputs']['occupation_id']
    result['most_recent_skill_viewed'] = input_data['inputs']['skill_id']
    
    return result

The issue I'm having is that after monitoring schedule job is completed I'm getting a constraint violation on locale feature that Categorical value match requirement is not met. Expected match: 100.0%. Observed: 0.0% of the values match the known values. . Digging deeper into the new generated constraint.json from monitoring job I see that locale feature doesn't have the same constraint as before and now most_recent_enrollment feature have its constraint

"features" : [ {
    "name" : "previous_enrollment",
    "inferred_type" : "String",
    "completeness" : 1.0,
    "string_constraints" : {
      "domains" : [ "zh-TW", "es-419", "ko_KR", "fr_FR", "ka", "ar", "fr", "uk_UA", "uz", "uk", "pt", "zh_CN", "cs", "sr", "zh-CN", "it_IT", "el", "it", "hu_HU", "ca", "pt-BR", "vi", "hi_IN", "ko", "th_TH", "sv_SE", "et", "en_US", "de", "nb", "kk_KZ", "ru", "th", "id", "tr", "da", "de_DE", "fil", "en", "he", "ar_SA", "pt_BR", "tr_TR", "az", "es", "zh_TW", "en-GB", "nl_NL", "ja_JP", "ro", "pl_PL", "lt", "no", "id_ID", "ru_RU", "es_LA", "fa", "el_GR", "hu", "pl" ]
    }
  }

I printed out the value of result from preprocess_handler and values are in the correct field so I'm not sure how the value of one feature is being assigned to another one after preprocessing:

{'most_recent_enrollment': 'abcdx', 'previous_enrollment': 'xyze', 'most_recent_click': 'bnbf', 'locale': 'ru_RU', 'occupation_id': '   ', 'most_recent_skill_viewed': 'programming'}
1 Answer
0

Hello,

I understand that you tried to create a data monitoring schedule using the DefaultModelMonitor, but the output order was not preserved and would like to gather more information on the same.

Here, the issue you're facing is likely due to the order of the features being changed when the monitoring job generates the new constraints.json file. This is because the order of the features in the new constraints.json file may not match the order in which you're returning the features from the preprocess_handler.

One solution to this problem is to create the output dictionary in the preprocess_handler using an OrderedDict from the collections module. This way, the order of the features will be preserved when you return the dictionary from the preprocess_handler.

Here's how you can modify your preprocess_handler:

from collections import OrderedDict

def preprocess_handler(inference_record):
input_data = inference_record.endpoint_input.data
output_data = inference_record.endpoint_output.data

result = OrderedDict()
result['most_recent_enrollment'] = output_data["recommendations"][0] if len(output_data["recommendations"]) > 0 else None
result['previous_enrollment'] = input_data['inputs']['most_recent_enrollment']
result['most_recent_click'] = input_data['inputs']['most_recent_xdp_view']
result['locale'] = input_data['inputs']['locale']
result['occupation_id'] = input_data['inputs']['occupation_id']
result['most_recent_skill_viewed'] = input_data['inputs']['skill_id']

return result

By using OrderedDict, the order of the features in the dictionary will be preserved, and the new constraints.json file should match the order in which you're returning the features from the preprocess_handler.

I would request that you please try this on your end and let us know how it goes for you.

If you have any difficulty or if you still run into issues, please reach out to AWS Support [1] (Sagemaker) along with your issue or use case in detail, and we would be happy to assist you further.

References:


[1] Creating support cases and case management: https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
answered 9 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions