- Newest
- Most votes
- Most comments
I realized this issue can be resolved by switching to the huggingface batch transform in SageMaker with the following code:
retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri( "huggingface", # huggingface or lmi version=llm_image_uri_ver, session=Sagemaker_Session, region=region_name )
print ecr image uri
print(f"llm image uri: {llm_image}")
Define Model and Endpoint configuration parameter
config = { 'HF_MODEL_ID': HF_model_name, # model_id from hf.co/models 'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica 'MAX_INPUT_LENGTH': json.dumps(MAX_INPUT_LENGTH), # Max length of input text 'MAX_TOTAL_TOKENS': json.dumps(MAX_TOTAL_TOKENS), # Max length of the generation (including input text) 'MAX_BATCH_TOTAL_TOKENS': json.dumps(MAX_BATCH_TOTAL_TOKENS), # Limits the number of tokens that can be processed in parallel during the generation 'HUGGING_FACE_HUB_TOKEN': HUGGING_FACE_HUB_TOKEN
,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
HF_MODEL_QUANTIZE (Optional): Meaning: Enables model quantization to reduce model size and potentially improve performance, especially inference speed.
However, lower weight precision could affect the quality of the output for some models.
Typical Choices: Can be set to "bitsandbytes" or similar, depending on the quantization method supported.
}
check if token is set
#assert config['HUGGING_FACE_HUB_TOKEN'] != HUGGING_FACE_HUB_TOKEN, "Please set your Hugging Face Hub token"
create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel( role=my_role, image_uri=llm_image, env=config )
Specify the batch job hyperparameters here, If you want to treate each example hyperparameters different please pass hyper_params_dict as None
hyper_params = {"max_new_tokens":str( Max_New_Tokens),"truncate":str(Input_Truncation), "return_full_text":str( False)} #hyper_params = {"batch_size":str(Batch_Size), "max_new_tokens":str( Max_New_Tokens), "truncate":str(Input_Truncation), "return_full_text":str( False)}
#hyper_params_dict = {"HYPER_PARAMS": str(hyper_params)}
create transformer to run a batch job
batch_job = llm_model.transformer( instance_count=Instance_Count, instance_type=InstanceType, #strategy="MultiRecord",# Description: Determines how records should be batched. 'SingleRecord' means each record is sent individually, while 'MultiRecord' sends multiple records in a single batch. assemble_with="Line", strategy='SingleRecord', # strategy: Which determines how records should be batched into each prediction request within the batch transform job. ‘MultiRecord’ may be faster, but some use cases may require ‘SingleRecord’. output_path=s3_output_data_path, # we are using the s3 path to save the output with the input env = hyper_params, accept='application/json', #max_concurrent_transforms= MaxConcurrentTransforms,# (int): The maximum number of HTTP requests to be made to each individual transform container at one time. max_payload= MaxPayloadInMB,# (int): Maximum size of the payload in a single HTTP request to the container in MB. )
starts batch transform job and uses S3 data as input
batch_job.transform(
data= f"{s3_input_data_path}/{batchtransform_data_file_name}",
content_type='application/json',
split_type='Line',
input_filter="$", #
output_filter="$['id','SageMakerOutput']", # Select "id" from input and model output
join_source='Input', # None for not Including input data in the output, Input for including
wait=True,
)
Relevant content
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a month ago