Evaluate function calling accuracy with exact tool match comparison
exact_tool_match_reward
function to assess whether models correctly call the right functions with the correct arguments in the expected format.
messages
: Conversation history with user queries and assistant responsestools
: Available function definitions with schemasground_truth
: Expected tool calls in JSON formatevaluation_result
: Pre-computed evaluation scores for validationinput_metadata
: Additional context including task type and difficultyjson
: Python’s JSON module for parsing ground truth datatyping
: Python’s typing module for type hints (Any, Dict, List)EvaluationRow
: The data structure containing conversation messages and metadatadefault_single_turn_rollout_processor
: Default processor for single-turn conversationsevaluation_test
: Decorator for configuring evaluation testsexact_tool_match_reward
: Built-in function calling evaluation function@evaluation_test
decorator to configure the evaluation:
input_dataset
: Path to the function calling dataset JSONL filemodel
: The model to evaluate (Fireworks Kimi model in this case)mode
: pointwise
for evaluating individual rows since each row can be evaluated independentlydataset_adapter
: Function that converts function calling format to EvaluationRow objectsrollout_processor
: Uses default single-turn processor for function calling evaluationsexact_tool_match_reward
function analyzes the assistant’s responseexact_tool_match_reward
Functionexact_tool_match_reward
function is a built-in evaluation function that performs exact matching between generated and expected tool calls. It’s located in eval_protocol.rewards.function_calling
.
Key Features:
messages
: List of conversation messages (extracts tool calls from the last assistant message)ground_truth
: Expected tool calls dictionary for comparison**kwargs
: Additional parameters (not used in this implementation)EvaluateResult
with score (1.0 for exact match, 0.0 for any mismatch) and detailed reasoningeval_tool_call
Functioneval_tool_call
function, which handles the detailed comparison of tool calls.
Function Signature:
tool_calls
array<tool_call>...</tool_call>
tags in contentcompare_tool_calls
Functioncompare_tool_calls
function, which ensures exact matching.
Function Signature:
exact_tool_match_reward
function provides a strict but comprehensive evaluation that ensures models can: