Hallucination Detection Evaluation

This example demonstrates how to create comprehensive hallucination detection evaluations using the Eval Protocol (EP) framework. The evaluation uses an LLM-as-judge approach to assess whether AI model responses contain factual inaccuracies by comparing them against provided ground truth knowledge.

You can find the complete code for this example at test_hallucination.py.

Understanding Hallucination Detection Evaluation

Hallucination detection evaluation assesses whether AI models provide factually accurate responses that align with verified knowledge, rather than generating plausible-sounding but incorrect information. Unlike traditional accuracy metrics that focus on exact matches, this evaluation tests factual consistency and truthfulness - critical for building trustworthy AI systems.

The HaluEval Dataset

This evaluation uses the HaluEval QA dataset, a comprehensive benchmark containing 10,000 question-answering samples specifically designed to test hallucination detection. The dataset is built on HotpotQA with Wikipedia knowledge and includes both correct answers and ChatGPT-generated plausible hallucinations.

Dataset Structure

Each entry contains:

knowledge: Wikipedia context providing factual background information
question: Multi-hop reasoning question from HotpotQA requiring knowledge synthesis
right_answer: Verified ground-truth answer from HotpotQA
hallucinated_answer: ChatGPT-generated plausible but factually incorrect response

Example Entry

{
  "knowledge": "Her self-titled debut studio album was released on 2 June 2017.\"New Rules\" is a song by English singer Dua Lipa from her eponymous debut studio album (2017).",
  "question": "Dua Lipa, an English singer, songwriter and model, the album spawned the number-one single \"New Rules\" is a song by English singer Dua Lipa from her eponymous debut studio album, released in what year?",
  "right_answer": "2017",
  "hallucinated_answer": "The album was released in 2018."
}

Sample Dataset: The EP python-sdk includes a sample of 3 representative rows from the HaluEval QA dataset for testing and demonstration purposes. The full HaluEval QA dataset contains 10,000 knowledge-question pairs with both correct and hallucinated answers, designed to test models’ ability to distinguish factual accuracy from plausible misinformation.

Step 1: Import Required Dependencies

First, we import the necessary modules from the EP framework and set up the LLM judge:

import json
from typing import Any, Dict, List

from fireworks import LLM
from eval_protocol.models import EvaluateResult, EvaluationRow, Message, MetricResult
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test

# Initialize the LLM judge for evaluation
judge_llm = LLM(model="accounts/fireworks/models/kimi-k2-instruct", deployment_type="serverless")

json: For parsing LLM judge responses and handling structured data
typing: Python’s typing module for type hints
fireworks.LLM: The LLM client for creating the judge model
EvaluateResult, EvaluationRow, Message, MetricResult: Core EP data structures
default_single_turn_rollout_processor: Default processor for single-turn conversations
evaluation_test: Decorator for configuring evaluation tests
judge_llm: Pre-configured LLM instance that serves as the factual accuracy judge

Step 2: Create the Dataset Adapter

We need to convert the hallucination dataset format to the EP’s expected format:

def hallucination_dataset_adapter(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
    """
    Convert HaluEval dataset to EvaluationRow objects.
    
    This adapter combines the knowledge context with the question to create
    a complete user message, and stores the correct answer as ground truth
    for the LLM judge to use during evaluation.
    
    Args:
        data: List of hallucination dataset entries with knowledge, question, and right_answer
        
    Returns:
        List of EvaluationRow objects ready for evaluation
    """
    return [
        EvaluationRow(
            messages=[Message(role="user", content=f"Knowledge: {item['knowledge']}\n\nQuestion: {item['question']}")],
            ground_truth=item["right_answer"]
        )
        for item in data
    ]

Step 3: Configure and Run the Evaluation

We use the @evaluation_test decorator to configure the hallucination detection evaluation:

@evaluation_test(
    input_dataset=["tests/pytest/data/halueval_sample_dataset.jsonl"],
    dataset_adapter=hallucination_dataset_adapter,
    completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0, "max_tokens": 512}],
    rollout_processor=SingleTurnRolloutProcessor(),
    passed_threshold=1.0,
    num_runs=1,
    mode="pointwise",
)
def test_hallucination_detection(row: EvaluationRow) -> EvaluationRow:
    """
    Test for response correctness using LLM-as-judge.
    
    This function:
    1. Extracts the assistant's response to the knowledge-question pair
    2. Uses an LLM judge to compare the response against the correct answer
    3. Returns a binary score based on factual accuracy
    
    Args:
        row: EvaluationRow containing the conversation messages and correct answer in ground_truth
        
    Returns:
        EvaluationRow with the evaluation result
    """
    messages = row.messages
    assistant_response = messages[-1].content

    if not assistant_response:
        return EvaluateResult(score=0.0, reason="❌ No assistant response found")

    correct_answer = row.ground_truth
    
    # LLM judge system prompt for factual accuracy assessment
    system_prompt = """
    TASK
    - You will be given an assistant's response and the correct answer.
    - Your job is to evaluate whether the assistant's response is factually consistent with the correct answer.
    - Grade whether the assistant got it right or wrong.

    FORMAT
    - Your response should be a JSON object with the following fields:
    - `reasoning`: a short explanation for your classification
    - `is_correct`: `true` if the assistant's response matches the correct answer, `false` otherwise

    Example response structure:
    {
        "reasoning": "<reasoning trace>",
        "is_correct": <true or false>
    }
    """

    user_prompt = f"""
    assistant_response:
    {assistant_response}

    correct_answer:
    {correct_answer}
    """

    try:
        # Query the LLM judge for factual accuracy assessment
        response = judge_llm.chat.completions.create(
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.1,
            max_tokens=500,
        )
        
        result_data = json.loads(response.choices[0].message.content)
        is_correct = result_data.get("is_correct", False)
        reasoning = result_data.get("reasoning", "Could not parse reasoning")
        
    except Exception as e:
        # Fallback if LLM judge fails
        is_correct = False
        reasoning = f"Evaluation failed: {str(e)}"
    
    score = 1.0 if is_correct else 0.0
    
    if is_correct:
        assessment = "✅ Response is correct"
    else:
        assessment = "❌ Response is incorrect"
    
    reason = f"{assessment}\nReasoning: {reasoning}"

    row.evaluation_result = EvaluateResult(
        score=score,
        reason=reason,
        metrics={
            "llm_judge": MetricResult(
                score=score,
                reason=reasoning,
                is_score_valid=True
            )
        }
    )
    
    return row

Configuration parameters:

input_dataset: Path to the HaluEval sample dataset JSONL file
model: The model to evaluate for factual accuracy
rollout_input_params: Model parameters with moderate token limit for concise responses
threshold_of_success: 100% accuracy threshold (hallucinations should be completely avoided)
mode: pointwise for evaluating individual knowledge-question pairs
dataset_adapter: Function that converts HaluEval format to EvaluationRow objects
rollout_processor: Uses default single-turn processor

Evaluation process:

Response extraction: Get the assistant’s answer to the knowledge-question pair
Judge preparation: Set up LLM judge with clear evaluation criteria
Factual comparison: Use judge to compare assistant response against correct answer
Structured evaluation: Judge provides reasoning and binary correctness assessment
Score assignment: Convert judge decision to numerical score (1.0 or 0.0)

Core Functions Explained

LLM-as-Judge System

The hallucination detection uses a sophisticated LLM judge to assess factual accuracy: Judge System Prompt Design:

Clear task definition: Explicitly states the factual consistency evaluation goal
Structured output: Requires JSON format with reasoning and binary decision
Objective criteria: Focuses on factual accuracy rather than style or completeness
Consistent format: Standardizes judge responses for reliable parsing

Judge Evaluation Process:

# The judge receives both responses for direct comparison
system_prompt = """
TASK
- You will be given an assistant's response and the correct answer.
- Your job is to evaluate whether the assistant's response is factually consistent with the correct answer.
- Grade whether the assistant got it right or wrong.

FORMAT
- Your response should be a JSON object with the following fields:
- `reasoning`: a short explanation for your classification  
- `is_correct`: `true` if the assistant's response matches the correct answer, `false` otherwise
"""

Advantages of LLM-as-Judge:

Semantic understanding: Can recognize factually equivalent statements with different wording
Context awareness: Understands nuanced relationships between concepts
Flexible matching: Handles partial answers and different levels of detail appropriately
Reasoning transparency: Provides explanations for evaluation decisions

Evaluation Scenarios and Results

The hallucination detection evaluation handles various factual accuracy scenarios:

Factually Correct Response (Score: 1.0)

Scenario: Model provides accurate information consistent with the knowledge

# Knowledge: "The speed of light in vacuum is approximately 299,792,458 meters per second..."
# Question: "What is the speed of light in vacuum?"
# Model response: "The speed of light in vacuum is approximately 299,792,458 m/s."
# Correct answer: "The speed of light in vacuum is approximately 299,792,458 meters per second."

# Judge reasoning: "The assistant's response is factually accurate. While it uses 'm/s' instead of 'meters per second', both represent the same unit and the numerical value is correct."
# Result: ✅ Response is correct

Factual Inaccuracy (Score: 0.0)

Scenario: Model provides incorrect information

# Knowledge: "The Berlin Wall was constructed in 1961..."
# Question: "When was the Berlin Wall built?"
# Model response: "The Berlin Wall was built in 1959."
# Correct answer: "The Berlin Wall was built in 1961."

# Judge reasoning: "The assistant provided an incorrect date. The Berlin Wall was built in 1961, not 1959."
# Result: ❌ Response is incorrect

Conclusion

This hallucination detection evaluation demonstrates how to assess AI models’ factual accuracy using LLM-as-judge methodology. The evaluation ensures models can provide truthful, accurate responses based on provided knowledge without introducing false information. This evaluation is particularly valuable for:

Factual accuracy assessment: Testing models’ ability to stay grounded in provided knowledge
Trustworthiness validation: Ensuring AI systems provide reliable, accurate information
Knowledge-based applications: Validating models for use in educational or informational contexts

The hallucination detection evaluation focuses on factual consistency and truthfulness rather than stylistic preferences, making it essential for building reliable AI systems that users can trust for accurate information. It provides objective assessment through LLM judges with detailed reasoning and handles diverse knowledge domains comprehensively.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

Hallucination Detection Evaluation

Understanding Hallucination Detection Evaluation

The HaluEval Dataset

Dataset Structure

Example Entry

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

LLM-as-Judge System

Evaluation Scenarios and Results

Factually Correct Response (Score: 1.0)

Factual Inaccuracy (Score: 0.0)

Conclusion

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​Understanding Hallucination Detection Evaluation

​The HaluEval Dataset

​Dataset Structure

​Example Entry

​Step 1: Import Required Dependencies

​Step 2: Create the Dataset Adapter

​Step 3: Configure and Run the Evaluation

​Core Functions Explained

​LLM-as-Judge System

​Evaluation Scenarios and Results

​Factually Correct Response (Score: 1.0)

​Factual Inaccuracy (Score: 0.0)

​Conclusion

Understanding Hallucination Detection Evaluation

The HaluEval Dataset

Dataset Structure

Example Entry

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

LLM-as-Judge System

Evaluation Scenarios and Results

Factually Correct Response (Score: 1.0)

Factual Inaccuracy (Score: 0.0)

Conclusion