Detect factual inaccuracies using LLM-as-judge to compare responses against ground truth knowledge
knowledge
: Wikipedia context providing factual background informationquestion
: Multi-hop reasoning question from HotpotQA requiring knowledge synthesisright_answer
: Verified ground-truth answer from HotpotQAhallucinated_answer
: ChatGPT-generated plausible but factually incorrect responsejson
: For parsing LLM judge responses and handling structured datatyping
: Python’s typing module for type hintsfireworks.LLM
: The LLM client for creating the judge modelEvaluateResult
, EvaluationRow
, Message
, MetricResult
: Core EP data structuresdefault_single_turn_rollout_processor
: Default processor for single-turn conversationsevaluation_test
: Decorator for configuring evaluation testsjudge_llm
: Pre-configured LLM instance that serves as the factual accuracy judge@evaluation_test
decorator to configure the hallucination detection evaluation:
input_dataset
: Path to the HaluEval sample dataset JSONL filemodel
: The model to evaluate for factual accuracyrollout_input_params
: Model parameters with moderate token limit for concise responsesthreshold_of_success
: 100% accuracy threshold (hallucinations should be completely avoided)mode
: pointwise
for evaluating individual knowledge-question pairsdataset_adapter
: Function that converts HaluEval format to EvaluationRow objectsrollout_processor
: Uses default single-turn processor