IBM watsonx.governance Evaluation Studio for Advanced Prompt Assessment

Ravi Chamarthy
7 min readDec 26, 2024

--

(Co-Authored by Ravi Chamarthy, Arunkumar K Suryanarayanan)

IBM watsonx.governance Evaluation Studio

The problem statement: Evaluating multiple LLMs and Prompts for Quality Chatbot Responses.

Consider a scenario where a bank is building a chatbot powered by a Large Language Model using Retrieval-Augmented Generation, enabling the bank users to ask questions about banking products. Let’s say a team of LLM application developers is tasked with the development of this chatbot. The typical steps to develop the RAG powered chatbot are as follows:

  1. Split, index, and load the documents into the vector store.
  2. Query the vector store with the user’s question to retrieve the top-k chunks/contexts.
  3. Use these contexts and the user’s question to construct a system prompt and run it against an LLM.
  4. Finally, retrieve the answer.

Something like this:

Typical RAG Pipeline

In this process, LLM application developers typically face the following challenges:

  • Selecting the LLM: Among different LLMs, determine which one provides responses that result in better RAG metrics, such as Faithfulness, Answer Relevance, and Context Relevance.
  • Defining the system prompt: Does including varied and multiple N-shot examples result in better relevance scores and no hallucinated answers compared to using only a few N-shot examples?
  • Tuning the prompt parameters: For instance, does a prompt temperature value of 0.6 produce better answer relevance scores and fewer hallucinations compared to a temperature value of 0.9?

You can imagine various combinations for evaluating which LLM to use, what prompt string to employ, and which prompt parameters to set.

The obvious question is: How can LLM application developers streamline the LLM selection process, perform these evaluations, and effectively compare the quality metrics?

The Solution: IBM watsonx.governance Evaluation Studio.

IBM watsonx.governance Evaluation Studio enables LLM application developers to evaluate and compare generative AI assets using quality metrics and customizable criteria tailored to specific application needs. Developers can assess the performance of multiple assets simultaneously and compare results to identify the best LLM, prompt, or prompt parameters for their LLM-powered application.

By using Evaluation Studio, developers can accelerate the AI development process. The platform automates the evaluation of multiple AI assets for different tasks (like Summarisation, Content Generation, RAG, Text Classification, Entity Extraction, Question and Answering), removing the need to review each prompt template individually. Instead, a single AI evaluation experiment can be configured to test multiple prompt templates at once, saving valuable development time.

The Evaluation Procedure: IBM watsonx.governance Evaluation Studio.

Among the various combinations mentioned above, let’s consider a scenario where the LLM application developers have designed the following system prompt. They now want to validate it by testing the prompt against three different LLMs using validation data.

Context: {context}

Answer the following question using only information from the above context.
Answer in a complete sentence. If there is no good answer in the context,
say "I don't know".

Question: {question}

Answer:

Login to https://dataplatform.cloud.ibm.com/ and navigate to the project of your choice. Using IBM watsonx.ai Prompt Lab, in a wx context project, create 3 Prompt Template assets, each using a different LLM — as shown below.

Please note that, in practice, the system prompt would include much more sophisticated and detailed instructions with N-shot examples. However, for the purpose of this post, we are using a highly simplified prompt for demonstration.

Prompt Template Assets with different LLMs

You don’t need to evaluate each Prompt Template Asset individually as we would be evaluating all the Prompts at one go as part of the Evaluation Studio process, but the choice is yours.

The next step is to create an Evaluation Asset using the following steps:

  • In the project, click on “New asset”
  • Select “Evaluate and compare prompts”
  • Provide a name and an optional description for the experiment.
  • Select the task type for the prompt, which in this case is RAG.
  • Clicking “Next” will display the available RAG-based prompt templates from the current project.
Listing of the prompt template assets for the Evaluation Experiment.
  • Clicking “Next” again will bring you to a screen where you can configure the metric dimensions: “Generative AI Quality” and “Model Health.”
  • For “Generative AI Quality”, click “Configure” to select the metrics for evaluation.
Configure the Generative AI Quality Dimension.
  • Select all metrics except “Data safety” metrics (for this demonstration, we are not checking for HAP and PII content in the validation data, as we know it doesn’t contain any, so no need to check it again).
Dimensions selection for evaluation.
  • To evaluate Answer Quality and Retrieval Quality metrics, configure an LLM-as-a-Judge model to assess Faithfulness, Context Relevance, and Answer Relevance metrics. We are using Mistral AI as the Judge Evaluator for the RAG metrics evaluation. Click on “Save”
  • Click again on “Save”, after changing any metric thresholds, as needed.
Configure Evaluations
  • If you go back to the beginning of this section, the system prompt contains two variables: context and question. In the follow-up screen, select these two prompt variables.
RAG specific settings
  • Clicking “Next” will provide a screen for test/validation data selection. In the context of RAG, the test/validation file should contain the columns for Question, Context, and Reference Answer. The Reference Answer column is optional and is needed when one would like to compute reference based metrics like ROUGE, BLEU, etc.
Test/Validation data selection for running the Evaluation Experiment.
  • Select the data asset file from the Project Assets.
  • In the follow-up screen, select the appropriate mappings: choose the column from the data asset that corresponds to the context, the column from the CSV that corresponds to the reference answer (for computing reference-based quality metrics such as ROUGE, BLEU, etc.), and the column for the question.
Data mapping
  • Review the configuration selections and click on “Run evaluation”
Review and Run evaluation
  • The evaluation will be triggered, and you will see the evaluation progress screen, as shown below.
Evaluation Progress

This concludes the evaluation configuration and triggering. In the next section, we will review the evaluation output.

The Evaluation Results: IBM watsonx.governance Evaluation Studio.

Once the evaluation job is completed, the Evaluation/Metrics comparison screen will be displayed, showing the computed metrics.

Evaluation Output

The Metrics comparison screen will display the metrics for the selected metrics group, with each bar in the chart representing the respective selected LLM. For example, the red-colored bar corresponds to the Granite model. You can observe that the Mistral model performs better for the Faithfulness and Answer Relevance metrics for the selected Prompts.

You can change the Metrics group using the dropdown, as shown in the screenshot below, to view the computed metrics for the selected prompts.

Viewing metrics for different metric groups

You can pin a prompt to view the difference in metric values across different prompts, as shown in the screenshot below.

Pin and compare

Click on the icon beside the Metrics Group selection to view the “Custom Ranking” option. This allows the LLM application developer to apply weight factors to a set of metrics and see how the computed score fares across the selected prompts. For example, in the screenshot below, the Faithfulness metric is assigned a weight factor of 1, and the Answer Relevance metric is assigned a weight factor of 0.8.

Custom ranking configuration

When you click “Apply,” the Metrics comparison screen is updated with the selected metrics, and the prompts are ranked based on the custom ranking score. As you can see, Mistral performs better according to the custom ranking score (again this is only based on the demo/synthetic data, and not based on actual customer data) for the selected Prompts.

Custom ranking results

Well, I can go on with the features, but will pause here..

In summary, IBM watsonx.governance Evaluation Studio enables developers to configure a single evaluation experiment to test and compare multiple prompt templates across varied LLMs simultaneously, streamlining the process, reducing manual effort, and boosting development efficiency.

Give it a try today at cloud.ibm.com by provisioning watsonx.governance!

References

Happy Evaluating! — IBM watsonx.governance — Streamlining Generative AI evaluations for smarter decisions!!

--

--

Ravi Chamarthy
Ravi Chamarthy

Written by Ravi Chamarthy

STSM, IBM watsonx.governance - Monitoring & IBM Master Inventor

No responses yet