Function Calling Evaluation

This example demonstrates how to create comprehensive function calling evaluations using the Eval Protocol (EP) framework. The evaluation uses the exact_tool_match_reward function to assess whether models correctly call the right functions with the correct arguments in the expected format.

You can find the complete code for this example at test_pytest_function_calling.py.

Understanding Function Calling Evaluation

Function calling evaluation assesses a model’s ability to:

Identify when to use tools: Determine if a user query requires function execution
Select the correct function: Choose the appropriate tool from available options
Provide accurate arguments: Pass the right parameters with correct values
Follow proper formatting: Use the expected tool call structure

Unlike text-based evaluations that focus on content generation, function calling evaluations test a model’s tool selection and parameterization capabilities - critical skills for AI agents that interact with external systems.

Understanding the Dataset Structure

The function calling dataset contains diverse test cases that evaluate different aspects of tool usage, from simple weather queries to complex nested object creation.

Dataset Format

Each entry in the dataset contains:

messages: Conversation history with user queries and assistant responses
tools: Available function definitions with schemas
ground_truth: Expected tool calls in JSON format
evaluation_result: Pre-computed evaluation scores for validation
input_metadata: Additional context including task type and difficulty

Example Dataset Entries

Perfect Match - Weather Query:

{
  "messages": [
    {"role": "user", "content": "What's the weather in London?"},
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"
          }
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather information for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "The city name"},
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "Temperature unit"
            }
          },
          "required": ["location", "unit"]
        }
      }
    }
  ],
  "ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"London\\\", \\\"unit\\\": \\\"celsius\\\"}\"}}]}",
  "input_metadata": {
    "row_id": "weather_london_perfect",
    "dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
  }
}

Argument Mismatch - Wrong Unit:

{
  "messages": [
    {"role": "user", "content": "What's the weather in London?"},
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"London\", \"unit\": \"fahrenheit\"}"
          }
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather information for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "The city name"},
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "Temperature unit"
            }
          },
          "required": ["location", "unit"]
        }
      }
    }
  ],
  "ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"London\\\", \\\"unit\\\": \\\"celsius\\\"}\"}}]}",
  "input_metadata": {
    "row_id": "weather_london_unit_mismatch",
    "dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
  }
}

Function Name Mismatch:

{
  "messages": [
    {"role": "user", "content": "What's the weather in London?"},
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "fetch_weather",
            "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"
          }
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather information for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "The city name"},
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "Temperature unit"
            }
          },
          "required": ["location", "unit"]
        }
      }
    }
  ],
  "ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"London\\\", \\\"unit\\\": \\\"celsius\\\"}\"}}]}",
  "input_metadata": {
    "row_id": "weather_london_name_mismatch",
    "dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
  }
}

No Tool Call Expected:

{
  "messages": [
    {"role": "user", "content": "Tell me a joke."},
    {"role": "assistant", "content": "Why did the chicken cross the road?"}
  ],
  "tools": [],
  "ground_truth": "{\"tool_calls\": []}",
  "input_metadata": {
    "row_id": "joke_no_calls",
    "dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
  }
}

Complex Nested Object Creation:

{
  "messages": [
    {"role": "user", "content": "Create a user for John Doe"},
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "create_user",
            "arguments": "{\"user\": {\"firstName\": \"John\", \"lastName\": \"Doe\", \"age\": 30}}"
          }
        }
      ]
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "create_user",
        "description": "Create a new user",
        "parameters": {
          "type": "object",
          "properties": {
            "user": {
              "type": "object",
              "properties": {
                "firstName": {"type": "string"},
                "lastName": {"type": "string"},
                "age": {"type": "number"}
              },
              "required": ["firstName", "lastName", "age"]
            }
          },
          "required": ["user"]
        }
      }
    }
  ],
  "ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"create_user\", \"arguments\": \"{\\\"user\\\": {\\\"firstName\\\": \\\"John\\\", \\\"lastName\\\": \\\"Doe\\\", \\\"age\\\": 30}}\"}}]}",
  "input_metadata": {
    "row_id": "create_user_nested",
    "dataset_info": {"task_type": "function_calling", "difficulty": "hard"}
  }
}

Dataset Characteristics

Test Scenarios: The dataset covers various function calling challenges:

Perfect matches: Correct function name and arguments
Argument mismatches: Wrong parameter values (e.g., wrong temperature unit)
Function name errors: Calling non-existent or wrong functions
Extra calls: Making unnecessary tool calls
Missing calls: Failing to call required functions
No-call scenarios: Queries that don’t require function execution
Complex objects: Nested parameter structures
Invalid JSON: Malformed argument strings

Tool Types: Various function categories:

Weather services: Location-based queries with units
User management: CRUD operations with complex objects
Data retrieval: Search and find operations
Utility functions: Simple parameterized operations

Difficulty Levels: Progressive complexity:

Easy: Simple single-parameter calls
Medium: Multi-parameter calls with validation
Hard: Nested object structures and complex schemas

Step 1: Import Required Dependencies

First, we import the necessary modules from the EP framework:

import json
from typing import Any, Dict, List
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.function_calling import exact_tool_match_reward

json: Python’s JSON module for parsing ground truth data
typing: Python’s typing module for type hints (Any, Dict, List)
EvaluationRow: The data structure containing conversation messages and metadata
default_single_turn_rollout_processor: Default processor for single-turn conversations
evaluation_test: Decorator for configuring evaluation tests
exact_tool_match_reward: Built-in function calling evaluation function

Step 2: Create the Dataset Adapter

We need to convert the function calling dataset format to the EP’s expected format:

def function_calling_to_evaluation_row(rows: List[Dict[str, Any]]) -> List[EvaluationRow]:
    """
    Convert function calling dataset entries to EvaluationRow objects.
    
    This adapter extracts the conversation messages, available tools, and ground truth
    from the function calling dataset format and creates EvaluationRow objects that
    the EP framework can process.
    
    Args:
        rows: List of function calling dataset entries
        
    Returns:
        List of EvaluationRow objects ready for evaluation
    """
    dataset: List[EvaluationRow] = []
    for row in rows:
        dataset.append(
            EvaluationRow(
                messages=row["messages"][:1],  # Only the user message
                tools=row["tools"],            # Available function definitions
                ground_truth=row["ground_truth"]  # Expected tool calls
            )
        )
    return dataset

This adapter:

Takes the raw function calling dataset as a list of dictionaries
Extracts the user message (first message in the conversation)
Includes the available tools/function definitions
Sets the ground truth to the expected tool calls
Returns the list of evaluation rows

Key transformations:

Message extraction: Uses only the user message since the assistant’s response will be generated during evaluation
Tool preservation: Maintains the function schemas for context
Ground truth: Preserves the expected tool calls for comparison

Step 3: Configure and Run the Evaluation

We use the @evaluation_test decorator to configure the evaluation:

@evaluation_test(
    input_dataset=["tests/pytest/data/function_calling.jsonl"],
    completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct"}],
    mode="pointwise",
    dataset_adapter=function_calling_to_evaluation_row,
    rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_pytest_function_calling(row: EvaluationRow) -> EvaluationRow:
    """Run pointwise evaluation on sample dataset using pytest interface."""
    ground_truth = json.loads(row.ground_truth)
    result = exact_tool_match_reward(row.messages, ground_truth)
    row.evaluation_result = result
    print(result)
    return row

Configuration parameters:

input_dataset: Path to the function calling dataset JSONL file
model: The model to evaluate (Fireworks Kimi model in this case)
mode: pointwise for evaluating individual rows since each row can be evaluated independently
dataset_adapter: Function that converts function calling format to EvaluationRow objects
rollout_processor: Uses default single-turn processor for function calling evaluations

Evaluation process:

Parse ground truth: Convert the JSON string to a dictionary for comparison
Extract tool calls: The exact_tool_match_reward function analyzes the assistant’s response
Compare exactly: Check if function names, arguments, and order match perfectly
Return results: Provide binary score (1.0 for perfect match, 0.0 for any mismatch)

Core Functions Explained

`exact_tool_match_reward` Function

The exact_tool_match_reward function is a built-in evaluation function that performs exact matching between generated and expected tool calls. It’s located in eval_protocol.rewards.function_calling. Key Features:

Exact matching: Requires perfect alignment of function names, arguments, and order
Multiple formats: Handles both structured tool calls and XML-formatted calls
JSON parsing: Automatically deserializes and normalizes tool call arguments
Robust comparison: Uses sorted JSON serialization for consistent comparison
Error handling: Gracefully handles malformed inputs and edge cases

Function Signature:

def exact_tool_match_reward(
    messages: Union[List[Message], List[Dict[str, Any]]],
    ground_truth: Optional[Dict[str, Any]] = None,
    **kwargs: Any,
) -> EvaluateResult:

Parameters:

messages: List of conversation messages (extracts tool calls from the last assistant message)
ground_truth: Expected tool calls dictionary for comparison
**kwargs: Additional parameters (not used in this implementation)

Return Value:

EvaluateResult with score (1.0 for exact match, 0.0 for any mismatch) and detailed reasoning

Example Usage:

result = exact_tool_match_reward(
    messages=messages,
    ground_truth={
        "tool_calls": [
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": '{"location": "London", "unit": "celsius"}'
                }
            }
        ]
    }
)
print(f"Score: {result.score}")  # 1.0 if exact match, 0.0 otherwise
print(f"Reason: {result.reason}")  # Detailed explanation of the evaluation

`eval_tool_call` Function

The core evaluation logic is implemented in the eval_tool_call function, which handles the detailed comparison of tool calls. Function Signature:

def eval_tool_call(generation: dict, ground_truth: dict) -> bool:

Implementation Details:

Extract expected calls: Parse ground truth tool calls from the expected format
Process generated calls: Handle both structured tool calls and XML-formatted calls
Normalize formats: Convert all calls to a consistent internal format
Compare exactly: Use JSON serialization with sorted keys for deterministic comparison

Supported Formats:

Structured tool calls: Standard OpenAI format with tool_calls array
XML-formatted calls: <tool_call>...</tool_call> tags in content
Mixed formats: Combinations of different call types

`compare_tool_calls` Function

The final comparison is performed by the compare_tool_calls function, which ensures exact matching. Function Signature:

def compare_tool_calls(generated_tool_calls: list, gt_tool_calls: list) -> bool:

Comparison Logic:

Length check: Number of tool calls must match exactly
JSON serialization: Convert each tool call to sorted JSON string
Exact matching: Compare serialized strings for perfect equality
Order matters: Tool calls must be in the same sequence

Example Comparison:

# Generated calls
generated = [
    {"name": "get_weather", "arguments": '{"location": "London", "unit": "celsius"}'}
]

# Expected calls
expected = [
    {"name": "get_weather", "arguments": '{"location": "London", "unit": "celsius"}'}
]

# Result: True (exact match)

Evaluation Scenarios and Results

The function calling evaluation handles various scenarios with different outcomes:

Perfect Match (Score: 1.0)

Scenario: Model calls the exact function with correct arguments

{
  "generated": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"},
  "expected": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"}
}

Result: ✅ Perfect match - all function names, arguments, and order are correct

Argument Mismatch (Score: 0.0)

Scenario: Model calls correct function but with wrong arguments

{
  "generated": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"fahrenheit\"}"},
  "expected": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"}
}

Result: ❌ Argument mismatch - wrong temperature unit specified

Function Name Error (Score: 0.0)

Scenario: Model calls wrong function name

{
  "generated": {"name": "fetch_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"},
  "expected": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"}
}

Result: ❌ Function name error - called non-existent function

Extra Tool Call (Score: 0.0)

Scenario: Model makes unnecessary additional calls

{
  "generated": [
    {"name": "get_weather", "arguments": "{\"location\": \"London\"}"},
    {"name": "extra_call", "arguments": "{}"}
  ],
  "expected": [
    {"name": "get_weather", "arguments": "{\"location\": \"London\"}"}
  ]
}

Result: ❌ Extra tool call - made unnecessary additional function call

Missing Tool Call (Score: 0.0)

Scenario: Model fails to call required function

{
  "generated": [],
  "expected": [
    {"name": "get_weather", "arguments": "{\"location\": \"London\"}"}
  ]
}

Result: ❌ Missing tool call - failed to call required function

No Call Expected (Score: 1.0)

Scenario: Query doesn’t require function execution

{
  "generated": [],
  "expected": []
}

Result: ✅ No call expected - correctly avoided unnecessary function calls

Advanced Features

XML-Formatted Tool Calls

The evaluation supports XML-formatted tool calls embedded in content:

# Assistant response with XML formatting
content = '<tool_call>{"type": "function", "function": {"name": "get_weather", "arguments": "{\\"location\\": \\"Berlin\\", \\"unit\\": \\"celsius\\"}"}}</tool_call>'

# The evaluation automatically parses and compares these calls

Complex Nested Objects

The evaluation handles complex parameter structures:

# Nested user object creation
{
  "name": "create_user",
  "arguments": '{"user": {"firstName": "John", "lastName": "Doe", "age": 30}}'
}

Multiple Tool Calls

The evaluation supports scenarios with multiple sequential tool calls:

# Multiple weather queries
[
  {"name": "get_weather", "arguments": '{"location": "London"}'},
  {"name": "get_weather", "arguments": '{"location": "Paris"}'}
]

Best Practices for Function Calling Evaluation

Dataset Design

Diverse scenarios: Include various failure modes and edge cases
Progressive difficulty: Start with simple calls and progress to complex objects
Real-world examples: Use realistic function schemas and use cases
Clear ground truth: Ensure expected tool calls are unambiguous

Evaluation Configuration

Appropriate models: Use models with strong function calling capabilities
Consistent parameters: Use deterministic settings (temperature=0.0) for reproducible results
Adequate context: Provide clear function descriptions and examples
Error handling: Gracefully handle parsing errors and edge cases

Result Interpretation

Binary scoring: Understand that this is a strict exact-match evaluation
Detailed analysis: Use the reasoning field to understand specific failures
Pattern recognition: Look for systematic errors in function selection or argument formatting
Model comparison: Compare different models’ function calling accuracy

Conclusion

This function calling evaluation example demonstrates how to create robust assessments of AI models’ tool usage capabilities. The exact_tool_match_reward function provides a strict but comprehensive evaluation that ensures models can:

Identify when tools are needed: Distinguish between queries requiring function calls and those that don’t
Select appropriate functions: Choose the correct tool from available options
Provide accurate parameters: Pass the right arguments with correct values
Follow proper formatting: Use the expected tool call structure consistently

This evaluation is particularly valuable for:

Agent development: Ensuring AI agents can reliably interact with external systems
API integration: Validating models’ ability to use structured APIs correctly
Tool selection: Testing models’ understanding of when and how to use different tools
Parameter accuracy: Verifying that models provide correct input values

The function calling evaluation complements other evaluation types by focusing on execution accuracy rather than content generation, making it essential for building reliable AI systems that can interact with external tools and APIs.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

Function Calling Evaluation

Understanding Function Calling Evaluation

Understanding the Dataset Structure

Dataset Format

Example Dataset Entries

Dataset Characteristics

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

`exact_tool_match_reward` Function

`eval_tool_call` Function

`compare_tool_calls` Function

Evaluation Scenarios and Results

Perfect Match (Score: 1.0)

Argument Mismatch (Score: 0.0)

Function Name Error (Score: 0.0)

Extra Tool Call (Score: 0.0)

Missing Tool Call (Score: 0.0)

No Call Expected (Score: 1.0)

Advanced Features

XML-Formatted Tool Calls

Complex Nested Objects

Multiple Tool Calls

Best Practices for Function Calling Evaluation

Dataset Design

Evaluation Configuration

Result Interpretation

Conclusion

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​Understanding Function Calling Evaluation

​Understanding the Dataset Structure

​Dataset Format

​Example Dataset Entries

​Dataset Characteristics

​Step 1: Import Required Dependencies

​Step 2: Create the Dataset Adapter

​Step 3: Configure and Run the Evaluation

​Core Functions Explained

​exact_tool_match_reward Function

​eval_tool_call Function

​compare_tool_calls Function

​Evaluation Scenarios and Results

​Perfect Match (Score: 1.0)

​Argument Mismatch (Score: 0.0)

​Function Name Error (Score: 0.0)

​Extra Tool Call (Score: 0.0)

​Missing Tool Call (Score: 0.0)

​No Call Expected (Score: 1.0)

​Advanced Features

​XML-Formatted Tool Calls

​Complex Nested Objects

​Multiple Tool Calls

​Best Practices for Function Calling Evaluation

​Dataset Design

​Evaluation Configuration

​Result Interpretation

​Conclusion

Understanding Function Calling Evaluation

Understanding the Dataset Structure

Dataset Format

Example Dataset Entries

Dataset Characteristics

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

`exact_tool_match_reward` Function

`eval_tool_call` Function

`compare_tool_calls` Function

Evaluation Scenarios and Results

Perfect Match (Score: 1.0)

Argument Mismatch (Score: 0.0)

Function Name Error (Score: 0.0)

Extra Tool Call (Score: 0.0)

Missing Tool Call (Score: 0.0)

No Call Expected (Score: 1.0)

Advanced Features

XML-Formatted Tool Calls

Complex Nested Objects

Multiple Tool Calls

Best Practices for Function Calling Evaluation

Dataset Design

Evaluation Configuration

Result Interpretation

Conclusion