exact_tool_match_reward function to assess whether models correctly call the right functions with the correct arguments in the expected format.
You can find the complete code for this example at test_pytest_function_calling.py.
Understanding Function Calling Evaluation
Function calling evaluation assesses a model’s ability to:- Identify when to use tools: Determine if a user query requires function execution
- Select the correct function: Choose the appropriate tool from available options
- Provide accurate arguments: Pass the right parameters with correct values
- Follow proper formatting: Use the expected tool call structure
Understanding the Dataset Structure
The function calling dataset contains diverse test cases that evaluate different aspects of tool usage, from simple weather queries to complex nested object creation.Dataset Format
Each entry in the dataset contains:messages: Conversation history with user queries and assistant responsestools: Available function definitions with schemasground_truth: Expected tool calls in JSON formatevaluation_result: Pre-computed evaluation scores for validationinput_metadata: Additional context including task type and difficulty
Example Dataset Entries
Perfect Match - Weather Query:Dataset Characteristics
Test Scenarios: The dataset covers various function calling challenges:- Perfect matches: Correct function name and arguments
- Argument mismatches: Wrong parameter values (e.g., wrong temperature unit)
- Function name errors: Calling non-existent or wrong functions
- Extra calls: Making unnecessary tool calls
- Missing calls: Failing to call required functions
- No-call scenarios: Queries that don’t require function execution
- Complex objects: Nested parameter structures
- Invalid JSON: Malformed argument strings
- Weather services: Location-based queries with units
- User management: CRUD operations with complex objects
- Data retrieval: Search and find operations
- Utility functions: Simple parameterized operations
- Easy: Simple single-parameter calls
- Medium: Multi-parameter calls with validation
- Hard: Nested object structures and complex schemas
Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework:json: Python’s JSON module for parsing ground truth datatyping: Python’s typing module for type hints (Any, Dict, List)EvaluationRow: The data structure containing conversation messages and metadatadefault_single_turn_rollout_processor: Default processor for single-turn conversationsevaluation_test: Decorator for configuring evaluation testsexact_tool_match_reward: Built-in function calling evaluation function
Step 2: Create the Dataset Adapter
We need to convert the function calling dataset format to the EP’s expected format:- Takes the raw function calling dataset as a list of dictionaries
- Extracts the user message (first message in the conversation)
- Includes the available tools/function definitions
- Sets the ground truth to the expected tool calls
- Returns the list of evaluation rows
- Message extraction: Uses only the user message since the assistant’s response will be generated during evaluation
- Tool preservation: Maintains the function schemas for context
- Ground truth: Preserves the expected tool calls for comparison
Step 3: Configure and Run the Evaluation
We use the@evaluation_test decorator to configure the evaluation:
input_dataset: Path to the function calling dataset JSONL filemodel: The model to evaluate (Fireworks Kimi model in this case)mode:pointwisefor evaluating individual rows since each row can be evaluated independentlydataset_adapter: Function that converts function calling format to EvaluationRow objectsrollout_processor: Uses default single-turn processor for function calling evaluations
- Parse ground truth: Convert the JSON string to a dictionary for comparison
- Extract tool calls: The
exact_tool_match_rewardfunction analyzes the assistant’s response - Compare exactly: Check if function names, arguments, and order match perfectly
- Return results: Provide binary score (1.0 for perfect match, 0.0 for any mismatch)
Core Functions Explained
exact_tool_match_reward Function
The exact_tool_match_reward function is a built-in evaluation function that performs exact matching between generated and expected tool calls. It’s located in eval_protocol.rewards.function_calling.
Key Features:
- Exact matching: Requires perfect alignment of function names, arguments, and order
- Multiple formats: Handles both structured tool calls and XML-formatted calls
- JSON parsing: Automatically deserializes and normalizes tool call arguments
- Robust comparison: Uses sorted JSON serialization for consistent comparison
- Error handling: Gracefully handles malformed inputs and edge cases
messages: List of conversation messages (extracts tool calls from the last assistant message)ground_truth: Expected tool calls dictionary for comparison**kwargs: Additional parameters (not used in this implementation)
EvaluateResultwith score (1.0 for exact match, 0.0 for any mismatch) and detailed reasoning
eval_tool_call Function
The core evaluation logic is implemented in the eval_tool_call function, which handles the detailed comparison of tool calls.
Function Signature:
- Extract expected calls: Parse ground truth tool calls from the expected format
- Process generated calls: Handle both structured tool calls and XML-formatted calls
- Normalize formats: Convert all calls to a consistent internal format
- Compare exactly: Use JSON serialization with sorted keys for deterministic comparison
- Structured tool calls: Standard OpenAI format with
tool_callsarray - XML-formatted calls:
<tool_call>...</tool_call>tags in content - Mixed formats: Combinations of different call types
compare_tool_calls Function
The final comparison is performed by the compare_tool_calls function, which ensures exact matching.
Function Signature:
- Length check: Number of tool calls must match exactly
- JSON serialization: Convert each tool call to sorted JSON string
- Exact matching: Compare serialized strings for perfect equality
- Order matters: Tool calls must be in the same sequence
Evaluation Scenarios and Results
The function calling evaluation handles various scenarios with different outcomes:Perfect Match (Score: 1.0)
Scenario: Model calls the exact function with correct argumentsArgument Mismatch (Score: 0.0)
Scenario: Model calls correct function but with wrong argumentsFunction Name Error (Score: 0.0)
Scenario: Model calls wrong function nameExtra Tool Call (Score: 0.0)
Scenario: Model makes unnecessary additional callsMissing Tool Call (Score: 0.0)
Scenario: Model fails to call required functionNo Call Expected (Score: 1.0)
Scenario: Query doesn’t require function executionAdvanced Features
XML-Formatted Tool Calls
The evaluation supports XML-formatted tool calls embedded in content:Complex Nested Objects
The evaluation handles complex parameter structures:Multiple Tool Calls
The evaluation supports scenarios with multiple sequential tool calls:Best Practices for Function Calling Evaluation
Dataset Design
- Diverse scenarios: Include various failure modes and edge cases
- Progressive difficulty: Start with simple calls and progress to complex objects
- Real-world examples: Use realistic function schemas and use cases
- Clear ground truth: Ensure expected tool calls are unambiguous
Evaluation Configuration
- Appropriate models: Use models with strong function calling capabilities
- Consistent parameters: Use deterministic settings (temperature=0.0) for reproducible results
- Adequate context: Provide clear function descriptions and examples
- Error handling: Gracefully handle parsing errors and edge cases
Result Interpretation
- Binary scoring: Understand that this is a strict exact-match evaluation
- Detailed analysis: Use the reasoning field to understand specific failures
- Pattern recognition: Look for systematic errors in function selection or argument formatting
- Model comparison: Compare different models’ function calling accuracy
Conclusion
This function calling evaluation example demonstrates how to create robust assessments of AI models’ tool usage capabilities. Theexact_tool_match_reward function provides a strict but comprehensive evaluation that ensures models can:
- Identify when tools are needed: Distinguish between queries requiring function calls and those that don’t
- Select appropriate functions: Choose the correct tool from available options
- Provide accurate parameters: Pass the right arguments with correct values
- Follow proper formatting: Use the expected tool call structure consistently
- Agent development: Ensuring AI agents can reliably interact with external systems
- API integration: Validating models’ ability to use structured APIs correctly
- Tool selection: Testing models’ understanding of when and how to use different tools
- Parameter accuracy: Verifying that models provide correct input values

