Skip to main content
Rollout Status: Evals is currently being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see this feature in your account yet, reach out to your account manager to discuss access.
The Evals section is your command center for testing and evaluating AI Agent performance. Located in the Monitor tab (next to the Run tab) in the Agent builder, Evals enables you to create Test Suites, define evaluation criteria (Evaluators), run automated evaluations, and monitor ongoing performance—all without manual testing. Evals section showing Test Suites, Evaluators, Runs, and Performance

What you can do with Evals

Conduct Tests

Create Test Suites with scenarios that simulate real user interactions. Combine scenarios with Evaluators to measure accuracy and evaluate Agent performance automatically.

Create Evaluators

Define evaluation criteria that automatically assess Agent responses. Evaluators look for specific conditions and score conversations based on your defined rules.

Monitor Performance

Automatically evaluate live Agent conversations using global Evaluators. Track scores, view insights, and monitor quality over time without running manual tests.

Evals sections

The Evals section contains five main sections, accessible from the left sidebar:
  • Test Suites — Create and manage groups of Test scenarios for your Agent. Each Test Suite can contain multiple scenarios with different prompts and evaluation criteria.
  • Evaluators — Configure global evaluation criteria that can be applied across any Test Suite or scenario without needing to set them up each time.
  • Runs — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, credit spend, and creation dates for all past runs.
  • Publish Checks — Configure which Test Suites must pass before your Agent can be published. Set a pass threshold and optionally block publishing if evaluations fail.
  • Performance — Automatically evaluate live Agent conversations by selecting a global Evaluator, setting a sample rate, and filtering by conversation status.

Understanding Evaluators

Evaluators are evaluation criteria that automatically assess Agent conversations. There are two types of Evaluators:

Scenario Evaluators

Scenario Evaluators are created within individual Test scenarios. They evaluate the specific conversation generated by that scenario’s prompt.
  • Created inside a Test scenario
  • Only apply to the scenario they’re defined in
  • Scenario-specific evaluation criteria

Global Evaluators

Global Evaluators are configured in the Evaluators tab. They can be selected to run on any Test Suite or scenario without needing to configure them each time — think of them as reusable defaults.
  • Created in the Evaluators tab (separate from Test Suites)
  • Can be selected when running any Test Suite or individual scenario
  • Useful for standard criteria you want checked across scenarios, such as professional tone, no hallucinations, or brand voice compliance
  • Also used in the Performance tab to automatically evaluate live conversations

Evaluator types

When creating an Evaluator (either scenario-level or global), you choose from the following types:
Uses an LLM to evaluate conversations against a prompt you define.
FieldDescription
Evaluation PromptDescribe the criteria for passing
Judge modelSelect which model evaluates the conversation
Truncate long conversationsToggle to truncate lengthy conversations before evaluation
Checks whether the Agent’s response includes specific text.
FieldDescription
Required textThe text that must appear in the response
Checks whether the Agent’s response exactly matches an expected value.
FieldDescription
Expected valueThe exact message the Agent should have sent
Checks whether a specific tool was used during the conversation.
FieldDescription
ToolSelect the tool to check for
PositionWhether the tool was used anywhere, used first, or used last
ComparisonCheck if the tool was used at least, exactly, or at most X times
To create a global Evaluator:
  1. Go to the Monitor tab and select Evals, then select Evaluators
  2. Click + New Evaluator
  3. Select a Type (LLM Judge, String Contains, String Equals, or Tool Usage)
  4. Enter a Name for the Evaluator (e.g., “Professional Tone”)
  5. Configure the type-specific settings (see table above)
  6. Click Create Evaluator
When you run a Test scenario, scenario-level Evaluators are always included automatically. You can also add or remove global Evaluators (from the Evaluators tab) before each run, allowing you to mix standard criteria with scenario-specific evaluation rules.

Creating a Test Suite with a scenario

Follow these steps to create your first evaluation Test Suite:
  1. Open your Agent in the builder and click the Monitor tab (next to the Run tab). Select Evals from the left sidebar, then select Test Suites.
  2. Click the + New test suite button. Enter a name for your Test Suite and click Create.
  3. Click on the Test Suite you just created to open it.
  4. Click the + New Test button to create a scenario within your Test Suite.
  5. Fill in the scenario details:
    FieldDescriptionExample
    Scenario nameA descriptive name for this Test case”Response Empathy”
    Persona & situationThe persona or situation the simulated user will adopt”You are an impatient customer who wants quick answers about their bill.”
    First messageA fixed message the simulated user sends to your Agent as the opening message (optional)“Hi, I need help with my bill.”
    Max turnsMaximum conversation turns (1-50)10
    Number of runsHow many times this scenario should be executed3
  6. Add Evaluators to define how this specific scenario should be evaluated:
    FieldDescriptionExample
    TypeThe Evaluator typeLLM Judge
    NameName of the evaluation criterion”Empathy Shown”
    Type-specific configSettings based on the chosen type (see Evaluator types)Evaluation Prompt: “Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions?”
    Click Create Evaluator to save it. You can then create additional Evaluators to add more evaluation criteria to the scenario.
  7. (Optional) Add Tool simulations to emulate tool usage without actually calling the tools. Tool simulations are configured at the scenario level:
    • Select a tool to simulate
    • Provide a prompt describing what the tool should return (a fake response is generated based on your prompt)
    • In the Advanced dropdown, you can select a Simulation model to control which model generates the simulated response
  8. Click Save Test scenario to save your configuration.
You can add multiple scenarios to a single Test Suite to evaluate different aspects of your Agent’s behavior. Each scenario can have its own prompt, max turns, number of runs, Evaluators, and tool simulations.

Example scenarios

Here are some example Test scenarios you might create:
Scenario name: Response EmpathyPersona & situation: You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help. Express your concerns clearly and see if the Agent acknowledges your situation before jumping to solutions.Max turns: 10Evaluator: Empathy Shown (LLM Judge)
  • Evaluation Prompt: Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.
Scenario name: Product ExpertisePersona & situation: You are a procurement manager at a mid-sized company evaluating solutions for your team. You need specific details about enterprise pricing tiers, integration capabilities with existing tools like Salesforce and HubSpot, and data security certifications. Ask clarifying questions and compare features against competitors you’re also considering.Max turns: 15Evaluator: Accurate Information (LLM Judge)
  • Evaluation Prompt: Did the Agent provide accurate product information without making claims that cannot be verified? Responses should be factual, reference actual product capabilities, and acknowledge when information needs to be confirmed by a sales representative.
Scenario name: Escalation RequestPersona & situation: You are a paying customer who has experienced a service outage affecting your business operations. You’ve already troubleshooted with the knowledge base articles and need to speak with a senior support engineer or account manager. Be firm but professional in your request, and provide context about the business impact.Max turns: 5Evaluator: Appropriate Escalation (LLM Judge)
  • Evaluation Prompt: Did the Agent acknowledge the severity of the situation, validate the customer’s need for escalation, and initiate a handoff to a human representative while maintaining a professional and empathetic tone throughout?

Running evaluations

You can run an entire Test Suite or an individual Test scenario from within a Test Suite by clicking the Run button on either. You can select specific Test scenarios within a Test Suite to run certain ones at once, or run all scenarios in the Test Suite together. Note that you cannot bulk select and run multiple Test Suites at the same time.
  1. Enter a name for the evaluation run (e.g., “Scenario Run - Jan 14, 12:14 PM”). A default name with timestamp is provided.
  2. Select which global Evaluators to include in the run — you can add or remove global Evaluators before starting. Scenario-level Evaluators are always included automatically.
  3. Click Run to begin. The system will simulate conversations with your Agent based on your scenario prompts and evaluate them with your selected Evaluators.

Understanding results

After running an evaluation, you’ll see a detailed results screen:

Run summary

The top of the results page shows key metrics:
MetricDescription
Average ScoreOverall pass rate across all scenarios and Evaluators
Number of ConversationsHow many Test conversations were evaluated
Agent VersionThe version of the Agent that was tested

Scenario results

Each scenario displays:
ColumnDescription
StatusRunning, Completed, or Failed
NameThe scenario name
ScorePercentage of Evaluators that passed (shown with progress bar)
EvaluatorsPass/fail count (e.g., “1/1 passed”)
CreditsCredits consumed for this scenario

Viewing conversation details

Click View Conversation on any scenario to see:
  1. The full conversation between the simulated user and your Agent
  2. Evaluator verdicts from all Evaluators included in the run, with detailed explanations of why each Evaluator passed or failed
For example, an “Empathy Shown” Evaluator might show:
Pass: The Agent demonstrated strong empathy throughout the conversation. Key examples include: acknowledging the customer’s frustration with being transferred multiple times (“I completely understand how upsetting it must be to feel like you’re not getting the help you need”), validating her experience with the double charge (“I truly understand how frustrating it is to be charged twice”), and directly addressing her skepticism by saying “I completely understand your concerns, especially given your previous experience.”

Performance tab

The Performance tab lets you automatically evaluate live Agent conversations without manually running Test Suites. This is useful for ongoing quality monitoring.

Setting up Performance monitoring

  1. Go to the Monitor tab, select Evals, then select Performance.
  2. Select a global Evaluator you’ve created in the Evaluators tab.
  3. Set a Sample rate — the percentage of conversations to evaluate.
  4. (Optional) Set a Conversation status filter to only evaluate conversations with specific statuses (e.g., completed, escalated). Leave blank to evaluate all conversations.
  5. Save your settings.
Once configured, conversations matching your criteria will be automatically evaluated at the sample rate you’ve set.

Viewing Performance insights

After the Performance evaluator has processed conversations, you can view:
MetricDescription
Overall ScoreAggregate score across all evaluated conversations
Total RunsNumber of conversations evaluated
EvaluatorsWhich Evaluators are active
The Performance tab also includes:
  • Data points for the overall score over time
  • Evaluator breakdown showing individual scoring per Evaluator
  • Graphs visualizing Evaluator performance trends
  • List of evaluation runs with score, name, and the ability to view the full conversation
To adjust Performance settings after initial setup, click the Settings button in the top right corner of the Performance tab.

Publish Checks

Publish Checks let you choose which Test Suites to run before your Agent is published. If the results don’t meet your threshold, publishing can be blocked. You can configure Publish Checks from the Publish Checks section in Evals.

Test sets to run

Select which Test Suites to run before publishing. Click Add test sets to choose them — all scenarios in the selected Test Suites will be evaluated.

Publish settings

Configure how evaluations affect the publish process:
SettingDescription
Pass threshold (%)The minimum score percentage required for the evaluation to pass (e.g., 100%)
Block publish if evaluation failsWhen checked, the Agent will only be published if the evaluation score meets or exceeds the pass threshold. If unchecked, the Agent will be published even if the evaluation fails the threshold.
Once configured, click Save. When you next publish your Agent, the selected Test Suites will run automatically and the results will be checked against your threshold.

Best practices

Start simple

Begin with a few core scenarios that test your Agent’s primary use cases. Add complexity as you learn what matters most.

Be specific with Evaluators

Write detailed evaluation rules. Vague criteria lead to inconsistent results. Include specific examples of what passing looks like.

Test edge cases

Create scenarios for difficult situations: angry customers, off-topic requests, requests to bypass rules, etc.

Use Performance monitoring

Set up global Evaluators in the Performance tab to continuously monitor live Agent conversations without manual testing.

Frequently asked questions (FAQs)

You can add as many scenarios as needed to a single Test Suite. Each scenario is evaluated independently and can have its own Evaluators.
Credits consumed for each scenario are calculated by adding together:
  • The Agent task run (the conversation with your Agent)
  • The simulator (the persona/user simulation) - uses an LLM to simulate the user persona
  • The Evaluator evaluations (both scenario Evaluators and global Evaluators) - each Evaluator uses an LLM to evaluate the conversation
Each scenario shows its total credit usage in the results.
Yes, you can run the same Test scenarios again at any time. Each run is saved in your Runs history, allowing you to compare results across different Agent versions.
Scenario Evaluators are created within Test scenarios and only evaluate conversations generated by that specific scenario. Global Evaluators are created in the Evaluators tab and can be selected to run on any Test Suite or scenario, providing reusable evaluation criteria across all your tests. Global Evaluators are also used in the Performance tab for monitoring live conversations.
Evals is being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see the Evals section in the Monitor tab yet, reach out to your account manager to discuss access.