Large language models (LLMs) can generate biased, toxic, or hallucinated content, making model evaluation crucial. With hundreds of LLMs available, understanding metrics and conducting evaluations is time-consuming. Model evaluations are critical at all stages of development. Effective model evaluation streamlines selecting the right LLM for your generative AI use case.
As you embark on your project involving a foundational model, it is advisable to gather data to support your choice of the specific model you have selected. You may want to assess the accuracy of its responses, determine whether it generates toxic or stereotypical content, and evaluate the extent to which its outputs are factual or prone to hallucination. Amazon SageMaker Clarify offers a solution for foundation model evaluations, enabling data scientists and machine learning engineers to efficiently evaluate, compare, and select foundational models based on various criteria across different tasks within minutes. This tool provides a comprehensive report that allows you to assess the foundational model across multiple dimensions, facilitating an informed decision-making process.
What if you are using an LLM that is not a JumpStart or Amazon Bedrock model ?
You can customize your model evaluation to allow for a model that is not a JumpStart or Amazon Bedrock model or use a custom workflow for evaluation. Use fmeval, a library to evaluate Large Language Models (LLMs) in order to help select the best LLM for your use case.
When to create an automated model evaluation job using the fmeval library ?
If you looking for fine grain control over you model evaluation jobs.
If you looking to evaluate LLMs outside of AWS or non-JumpStart based models from other services.
So in summary, If you want to evaluate an LLM, Amazon SageMaker provides the following three options that you can choose:
- Set up manual evaluations for a human workforce using Studio.
- Evaluate your model with an algorithm using Studio.
- Automatically evaluate your model with a customized work flow using the fmeval library.
As you work on generative AI use cases, you may need to select and customize foundation models to power your applications. Evaluating and comparing these models during model selection and customization can take days, involving identifying relevant benchmarks, configuring evaluation tools, and conducting evaluations on each model. The results obtained are frequently challenging to apply to your specific use case.
SageMaker Clarify offers automated and human evaluations with interpretable results. You can use this capability in Amazon SageMaker Studio to evaluate SageMaker-hosted LLMs or use fmeval to evaluate any LLM. Get started by utilizing curated prompt datasets tailored for tasks like text generation, summarization, question answering, and classification. Customize inference parameters and prompt templates and compare results of different models settings. Extend evaluations with custom prompt datasets and metrics. Human evaluations enable you to assess more subjective aspects like creativity and style. Following each evaluation, you receive a comprehensive report, complete with visualizations and examples, which you can integrate into your SageMaker ML workflows. This blog will help you get started with evaluating FMs with SageMaker Clarify.
In this article, we explored ways to evaluate any large language model (LLM), including non-AWS models, using Amazon SageMaker Clarify and the fmeval library for automated or customized evaluations.
Note: Besides Amazon SageMaker, You can also use Amazon Bedrock’s model evaluation capability to evaluate, compare, and select the best foundation models for your use cases. We will explore this alternative in a future article.
Resources: