IBM watsonx.governance in Action: Evaluating DeepSeek R1-Powered RAG Apps

Ravi Chamarthy
5 min readJan 31, 2025

--

Setting the context

With the excitement surrounding the recent release of the DeepSeek R1 model, many LLM application developers are eager to explore its capabilities. A key question arises — does the DeepSeek model generate faithful responses based on the provided context, or does it produce hallucinated answers? Additionally, how relevant are the generated responses by the model to the questions asked?

This is where IBM watsonx.governance comes into play. It enables the evaluation of RAG-based applications powered by DeepSeek R1 by measuring key quality metrics such as faithfulness, answer relevancy, and similarity — comparing model-generated outputs against ground truth data or even answers generated from other large language models. By leveraging watsonx.governance, developers can gain deeper insights into model performance and ensure the reliability of their applications, including any Proof of Concepts they plan to build.

Model Installation — Local

On a RHEL VM (I used an 8-core, 16GB memory instance with multiple other applications running), install Ollama as the root user using the following command:

# curl -fsSL https://ollama.com/install.sh | sh

If Jupyter Notebooks are running under a non-root user (which is typically the case), download and run the DeepSeek R1 model as below using the non-root user.

(Initially, I experimented with the 14B model, but given the resource constraints of my VM, I opted for the bare 1.5B model instead.)

# ollama run deepseek-r1:1.5b

Model generation

There are multiple ways to connect to the model via Ollama. In this case, I chose to use LangChain to interact with the DeepSeek R1 model. The main reason for this choice is that LangChain provides Chain-of-Thought (CoT) reasoning in its output.

from langchain_community.llms import Ollama
llm = Ollama(model="deepseek-r1:1.5b")

As you may have already noticed by now, the DeepSeek model’s response includes CoT reasoning enclosed within <think> and </think> tags. Below is a utility function to extract the model’s reply while excluding the content within these tags.

def extract_model_reply(model_response):
output = model_response.generations[0][0].text
import re
model_reply = re.sub(r'^.*?</think>\s*', '', output, flags=re.DOTALL)
print(model_reply)
return model_reply

Below is the prompt, parameterized with the RAG $context and the $question to be asked.

from string import Template
prompt_template = Template("""
You are a helpful AI assistant. Use ONLY the following context to answer the question. If unsure, say "I don't know". Keep answers under 2 sentences.

Context: $context

Question: $question

Answer:
""")

For a sample dataset containing relevant contexts and corresponding questions, construct a prompt and invoke the model to generate responses. Then, use the utility function to extract the actual model output.

# accumulate the prompt response for each query
results = []
for relevant_context, question in zip(relevant_contexts, question_texts):
context = construct_context(relevant_context)
prompt_str = prompt_template.substitute(context=context, question=question)
model_response = llm.generate([prompt_str])
model_reply = extract_model_reply(model_response)
results.append(model_reply)

Below is one such sample output from DeepSeek R1 1.5B model.

generations=[[GenerationChunk(text='

<think>\nOkay, I need to figure out how to answer the question "What is the difference between copayment and coinsurance?"
based on the provided context.

Let me go through each term one by one.\n\n
First, a copayment refers to a fixed amount you pay for covered services, like $20 for a doctor visit.
It\'s a set amount that doesn\'t change regardless of how much your costs are beyond the deductible.\n\n
On the other hand, coinsurance is a percentage of costs you must pay before your insurance covers the rest.
For example, if the deductible is 10%, and you spend $150 for an office visit, you\'d pay $15 (10% of $150) in coinsurance.\n\n
So, the key difference is that copayment is a flat fee, while coinsurance adds on a percentage based on costs exceeding a certain threshold.
This makes me understand how each term works and helps prevent miscommunication about payment structures.\n</think>\n\n

The copayment is a fixed amount paid for covered services (e.g., $20), while the coinsurance is a percentage of costs paid before the deductible, such as 10% on an office visit costing $150.',

generation_info={'model': 'deepseek-r1:1.5b', 'created_at': '2025-01-30T13:03:52.62955432Z', 'response': '', 'done': True, 'done_reason': 'stop',
'context': [151644, ….,13], 'total_duration': 22509413057, 'load_duration': 2381354546, 'prompt_eval_count': 432, 'prompt_eval_duration': 7470000000,
'eval_count': 250, 'eval_duration': 12655000000})]] llm_output=None run=[RunInfo(run_id=UUID('f8795416-5415-47fe-b196-cdc7dde43cae'))]

Here is a Pandas DataFrame containing the question, DeepSeek R1 response, reference answer (e.g., from another LLM), and the RAG context.

DeepSeek R1 Response

Evaluation of RAG metrics using IBM watsonx.governance

Using the combination of Question, Contexts, DeepSeek R1 model generated answer, and with the reference answer, let’s evaluate a host of various metrics:

  • Faithfulness: Measures how well the generated text aligns with the context. Did the model stay on topic based on what it found?
  • Answer Relevance: Captures how relevant the final response is to the original prompt.
  • Context Relevance: Captures how valid are the retrieved contexts against the questions.
  • Unanswered Requests: Questions that could not be answered by the model.
  • Answer Similarity: Given a ground truth answer, evaluate the semantic similarity with the model generated answer.
  • Content Analysis Metrics
  • Coverage: This metric quantifies the extent to which the answer is derivative of the context, where it measures the percentage of summary words within the source text.
  • Density: The density measure is used to assess how well the overall sequence of words in the generated answer aligns with being a series of direct extractions from the source text.
  • Compression: The metric calculates the ratio of the number of words in the content to the number of words in the generated answer.
  • Novelty: The Novelty / Abstractness metric measures how many new n-grams are present in the generated answer as compared to the references.
  • Redundancy: The redundancy metric measures the percentage of n-grams in the answer which are (unnecessarily!) repeated.
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup, LLMCommonMetrics, LLMQAMetrics, LLMRAGMetrics, RetrievalQualityMetric

# Edit below values based on the input data
context_columns = ["contexts"]
question_column = "question"
answer_column = "d3r1_result"
reference_column = "reference_answer"

config_json = {
"configuration": {
"record_level": True,
"context_columns": context_columns,
"question_column": question_column,
LLMTextMetricGroup.RAG.value: {
LLMCommonMetrics.FAITHFULNESS.value: {},
LLMCommonMetrics.ANSWER_RELEVANCE.value: {},
LLMQAMetrics.UNSUCCESSFUL_REQUESTS.value: {},
LLMCommonMetrics.CONTENT_ANALYSIS.value: {},
LLMRAGMetrics.RETRIEVAL_QUALITY.value: {
RetrievalQualityMetric.CONTEXT_RELEVANCE.value: {
"ngrams": 5
},
RetrievalQualityMetric.RETRIEVAL_PRECISION.value: {},
RetrievalQualityMetric.AVERAGE_PRECISION.value: {},
RetrievalQualityMetric.RECIPROCAL_RANK.value: {},
RetrievalQualityMetric.HIT_RATE.value: {},
RetrievalQualityMetric.NDCG.value: {}
}
}
}
}

Call the watsonx.governance SDK to evaluate the metrics, by passing the model input and output and the reference answers.

df_input = pd.DataFrame(llm_data, columns=context_columns+[question_column])
df_output = pd.DataFrame(llm_data, columns=[answer_column])
df_reference = pd.DataFrame(llm_data, columns=[reference_column])

import json
metrics_result = client.llm_metrics.compute_metrics(config_json,
sources=df_input,
predictions=df_output,
references=df_reference,
# background_mode=False
)

Here are the evaluation results.

Evaluation Results

As observed, faithfulness is strong for some context-answer combinations but falls short in others, even when human oversight is involved. Similarly, while most generated answers appear relevant to the corresponding questions, some do not, despite expectations.

Notes

This article does not draw conclusions about the model’s overall performance, as it also depends on the dataset used. Instead, it focuses on demonstrating the mechanism for using IBM watsonx.governance to evaluate responses generated by the DeepSeek R1 model.

As most LLM application developers are still in the exploration phase of the DeepSeek R1 model, with many implementations being part of a Proof of Concept (PoC), this article focuses solely on evaluating metrics using the SDK approach.

Additionally, IBM watsonx.governance can be leveraged for continuous monitoring of prompts across any Large Language Model (LLM).

References

--

--

Ravi Chamarthy
Ravi Chamarthy

Written by Ravi Chamarthy

STSM, IBM watsonx.governance - Monitoring & IBM Master Inventor

No responses yet