Prompts in bedrock evaluation

0

In RAG evaluation job of bedrock, there are multiple metrics and since we are using LLM as judge I assume there will be prompts for each metric. I want to know for lets say 3 metrics: Faithfulness, correctness, completion. Will there be 3 different prompt calls to the LLM or will be just a single call to the LLM in which we will compact all the prompts for all the 3 metrics into one single prompt and maybe just a single LLM call.

asked a month ago68 views
2 Answers
0

In Amazon Bedrock's evaluation process, the approach to handling multiple metrics like faithfulness, correctness, and completion typically involves a single, consolidated prompt rather than separate calls for each metric.

When you set up an automatic model evaluation job in Amazon Bedrock, you select a task type and are provided with recommended metrics. For these metrics, Amazon Bedrock uses a method known as "LLM-as-a-judge," where a language model evaluates the outputs based on predefined criteria.

The evaluation process is designed to be efficient and streamlined. Instead of making separate LLM calls for each metric, Amazon Bedrock combines the evaluation criteria for multiple metrics into a single, comprehensive prompt. This consolidated prompt includes all the necessary information, instructions, and evaluation criteria for the different metrics you've selected.

This approach has several advantages:

  1. Efficiency: It reduces the number of API calls, which can save time and potentially reduce costs.
  2. Consistency: By evaluating all metrics in a single context, the LLM can provide more consistent judgments across different aspects of the response.
  3. Holistic evaluation: The LLM can consider all aspects simultaneously, potentially leading to a more nuanced and comprehensive evaluation.

When you create an evaluation job, you can use built-in prompt datasets that Amazon Bedrock provides, or you can create custom datasets. These datasets are used in conjunction with the evaluation prompt to assess the model's performance across the specified metrics.

It's worth noting that while the evaluation is done in a single prompt, the results are typically broken down by individual metrics, allowing you to analyze performance across different dimensions of quality separately.

This consolidated approach to prompt evaluation aligns with Amazon Bedrock's overall design philosophy of providing efficient, scalable, and comprehensive tools for managing and optimizing AI models and their outputs.

Sources
Evaluating prompts at scale with Prompt Management and Prompt Flows for Amazon Bedrock | AWS Machine Learning Blog
Use prompt datasets for model evaluation in Amazon Bedrock - Amazon Bedrock
Use built-in prompt datasets for automatic model evaluation in Amazon Bedrock in SageMaker Unified Studio - Amazon SageMaker Unified Studio
Prompt Management for Amazon Bedrock - AWS

profile picture
answered a month ago
0

I understand you’re looking for clarity on how evaluation metrics work with Amazon Bedrock—specifically whether having 3 evaluation metrics would result in 3 separate prompt calls to the LLM, or if it would be a single call with all 3 prompts combined.

To clarify, in the Amazon Bedrock model evaluation process, multiple evaluation metrics are handled through a single, consolidated prompt. This means the system does not make separate LLM calls for each metric—instead, it compacts all the prompts into one.

For more detailed information on how Amazon bedrock Model Evaluation on LLM-as-a judge please refer to below blog [+] https://aws.amazon.com/blogs/machine-learning/llm-as-a-judge-on-amazon-bedrock-model-evaluation/

AWS
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions