Using IBM watsonx.governance-monitoring toolkit evaluate the quality of the LLM powered application that uses langchain

Ravi Chamarthy
4 min readMar 26, 2024

(Co-Authored by Ravi Chamarthy and Manish Bhide)

Visual Abstraction: Evaluate the quality of Chains

Greetings!

The Use Case

Let’s consider a scenario where we are developing an LLM powered application using langchain, say for Mobiles Issues Summarization, further classifying the issue, and finally generate issue resolution for that issue type.

So, in total 3 processing steps in the langchain — for which we would be using 3 large language models, as below:

  • Issue Summarization — Azure OpenAI GPT Turbo 8K model.
  • Issue classification — IBM watsonx.ai Flan T5 XXL model.
  • Issue Resolution — IBM watsonx.ai Llama 2 13B model.

The Problem

But wait! .. how do we know the quality of each processing step in the langchain? Quality, as-in, is the generated mobile issue summary comparable to, say, any ground truth summary? Is the mobile issue resolution comparable to a ground truth resolution?

For this, IBM watsonx.governance — monitoring SDK has got a wide range of metrics, like ROUGE, BLEU, Text Quality, Input/Output HAP, Input/Output PII etc metrics, that can be evaluated on the generated summary and generated content.

But, again, how do we tie all these together?

This is where callbacks from langchain comes into the picture.

The Solution and Implementation

Build a chain of 3 models, where:

  • For the first model, send the issues content as the input and it will output the issue summarization.
  • For the second model, send the issue summarization and it will output the issue classification, like BatteryPerformance, StorageDataManagement, ConnectivityAndNetwork issues.
  • For the third model, send the issue classification and it will output the issue resolution as the output.

For this chain, associate a callback handler, which does the following:

  • Implement the callback event methods like “on_chain_start” and “on_chain_end”, to which the overall chain inputs and output are provided:
  • Issue Content
  • Issue Summary
  • Issue Classification
  • Issue Resolution.
  • Also send the ground truth summary and the resolution to the callback handler.

Now, we have all the necessary elements to measure the quality of the overall chain, by using IBM watsonx.governance-monitoring SDK which can be used to measure metrics like Rouge, Bleu, Sari, Input HAP, Output HAP, and any other custom metrics as well.

class MyCustomHandler(BaseCallbackHandler):
prompts_text = None
summary_ground_truth = None
resolution_ground_truth = None

def __init__(self, summary_ground_truth: str = "", resolution_ground_truth: str = ""):
self.summary_ground_truth = summary_ground_truth
self.resolution_ground_truth = resolution_ground_truth
print(self.summary_ground_truth)
print(self.resolution_ground_truth)

def on_chain_start(self, serialized: Dict[str, Any], prompts: Dict[str, Any], **kwargs: Any) -> Any:
print('Inside on_chain_start')
self.prompts_text = prompts
print(self.prompts_text)

def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> Any:
print('Inside on_chain_end')

overall_context = self.prompts_text['content']
generated_summary = outputs['summary']
resolution = outputs['resolution']

print('Evaluating Summarization quality..')
df_input = pd.DataFrame({'input_text': [overall_context]})
df_reference = pd.DataFrame({'ground_truth': [self.summary_ground_truth]})
df_output = pd.DataFrame({'generated_summary': [generated_summary]})
evals = client.llm_metrics.compute_metrics(summarization_metric_config,
sources = df_input,
predictions = df_output,
references = df_reference)
print('Summarization evaluation results:')
print(json.dumps(evals, indent=2))

print('\n')

print('Evaluating Content generation quality..')
df_reference = pd.DataFrame({'ground_truth': [self.resolution_ground_truth]})
df_output = pd.DataFrame({'resolution': [resolution]})
evals = client.llm_metrics.compute_metrics(generation_metric_config,
sources = df_input,
predictions = df_output,
references = df_reference)
print('Content generation evaluation results:')
print(json.dumps(evals, indent=2))

Tying all the elements together, we create a SequentialChain with all the 3 chains and also the callback handler, as below:

chain = SequentialChain(chains=[summarization_prompt_azure_openai, summarization_prompt_flan_t5, issue_resolution_flan_t5], 
input_variables=["content"],
output_variables=["summary", "issue_type", "resolution"],
callbacks=[MyCustomHandler(
summary_ground_truth = 'Push notifications from apps can drain battery life, users can manage notification settings to reduce battery consumption.',
resolution_ground_truth = 'Optimize background app usage to conserve battery.')],
verbose=True)

Now, we invoke the chain by passing a sample mobile issue.

issue = 'Apps that send push notifications at high frequencies can contribute to increased battery consumption. \
Users can manage notification settings, disable unnecessary alerts, and set longer intervals for non-essential updates.'

chain.invoke({"content" :issue})

For this issue, the chains will come into play, where each model is executed along with evaluating the metrics using IBM watsonx.governance-monitoring SDK, with sample metrics as below.

Inside on_chain_start
{'content': 'Apps that send push notifications at high frequencies can contribute to increased battery consumption. Users can manage notification settings, disable unnecessary alerts, and set longer intervals for non-essential updates.'}


> Entering new SequentialChain chain...
Inside on_chain_end
Evaluating Summarization quality..

Summarization evaluation results:
{
...
"rouge_score": {
"rouge1": {
"metric_value": 1.0
},
"rouge2": {
"metric_value": 1.0
},
"rougeL": {
"metric_value": 1.0
},
"rougeLsum": {
"metric_value": 1.0
}
},
"sari": {
"metric_value": 99.12280701754386
}
}


Evaluating Content generation quality..
Content generation evaluation results:
{
...
"rouge_score": {
"rouge1": {
"metric_value": 0.7143
},
"rouge2": {
"metric_value": 0.6667
},
"rougeL": {
"metric_value": 0.7143
},
"rougeLsum": {
"metric_value": 0.7143
}
}
}

> Finished chain.
{'content': 'Apps that send push notifications at high frequencies can contribute to increased battery consumption. Users can manage notification settings, disable unnecessary alerts, and set longer intervals for non-essential updates.',
'summary': 'Push notifications from apps can drain battery life, users can manage notification settings to reduce battery consumption.',
'issue_type': 'BatteryPerformance',
'resolution': 'a) Manage background app usage to conserve'}

In summary, this post, describes how to build a sequential chain with multiple prompt and their models, and register a callback handler with the chain, where the LLM quality metrics are evaluated.

References

--

--

Ravi Chamarthy

Software Architect, watsonx.governance - Monitoring & IBM Master Inventor