Article • 6 min read
Building realistic multi‑turn tests for AI agents
Inside Zendesk's new framework for measuring reliability in real-world, multi-turn conversations.
Mariana Almeida
Global Lead Scientist, Research and Applications at Zendesk
최종 업데이트: November 4, 2025
At Zendesk, we invest heavily in making sure our AI agents are not only smart, but also reliable in real‑world support scenarios. Our recent paper “Automated test generation to evaluate tool‑augmented LLMs as conversational AI agents” by Arcadinho et al. describes a methodology that directly supports that goal. Here’s a walkthrough of the key approach, and why it matters if you’re building agents that call APIs, follow procedures, and handle varied user behavior.
The central idea is simple: accuracy on a single task is not enough. What matters is whether agents can stay consistent across an entire conversation.
Why multi-turn testing matters
Support agents powered by large language models are no longer simple FAQ bots. They act on user requests—cancelling orders, updating accounts, diagnosing service issues—and must manage context across turns. Traditional benchmarks only validate single-turn tool calls, missing the complexity agents face in dialogs with clarifications, interruptions, and digressions.
Zendesk AI agents resolve a vast majority of customer and employee conversations autonomously, but ensuring they remain consistent across multi‑turn flows is critical (Zendesk, arXiv, Botpress, Zendesk). The paper’s automated evaluation pipeline is built specifically to catch failures in those scenarios. The results show that while models handle individual tool calls reliably, they often fail once a conversation involves multiple turns, clarifications, or interruptions.
Zendesk-inspired evaluation pipeline
Here’s how the paper’s pipeline echoes the engineering rigor we expect at Zendesk:
- Intent and procedure generation: A language model generates user intents (e.g. “reset my password”, “cancel order”) and detailed agent procedures to fulfill them.
- API/tool extraction: The procedure defines the exact tool or API calls required at each step—essential for verifying agent behavior.
- Conversation graph construction: Procedures map to directed flowgraphs that simulate paths an agent might follow—normal execution, decision branches, dead-ends, detours. This grounds conversations in the defined procedures and reduces hallucinations, while also ensuring that all possible paths are tested.
- Noise injection: Realism is added by injecting interruptions, clarifying questions, or context shifts—testing agent resilience under interruption.
- Path sampling and dialogue synthesis: A weighted random walk samples multiple trajectories. Each path is turned into a full dialogue with the LLM synthesizing user and agent turns.
- Test extraction and validation: Conversations are split into individual test cases with expected API outcomes. Automated validation and manual review ensure quality.
This technical pipeline resembles a system we’d build into Zendesk’s agent testing framework: grounded, procedure‑based, and robust under divergence. It helps expose edge cases and context‑management failures that simple accuracy tests miss (ugo-plus.zendesk.com, Flow AI).
The Zendesk ALMA (Agentic Language Model Assessment) Benchmark: Practically grounded and curated
The team produced Zendesk ALMA, a benchmark Zendesk open sourced here, comprising of 1,420 conversations manually curated from a larger automatically-generated data set. Each conversation includes:
Full agent/user dialogue.
Expected tool/API call sequence.
Clear evaluation criteria for success/failure.
This mirrors our thinking at Zendesk—datasets need procedural transparency and curated realism to be meaningful for agent evaluation. Zendesk ALMA is a leading benchmark to cover both tool use and full conversations in support settings, making it a reference point for evaluating real agent performance.
Performance of modern agents
In testing GPT‑4–based agents and others using Zendesk ALMA, the results follow a consistent pattern:
- Single-turn tool invocation: agents perform well.
In controlled, single-step tasks, modern LLMs perform reliably. On Zendesk ALMA, models such as GPT-4o, Claude 3 Sonnet, and Mistral-NeMo all achieved over 90% accuracy in identifying when an API call was required and selecting the correct one. Their parameter accuracy was similarly strong, in the 85–95% range. These results confirm that today’s agents can handle straightforward, isolated tool use with high precision.
- Multi-turn, noisy dialogs: agents stumble—missing context, making incorrect calls, or failing to complete tasks.
The picture changes when conversations involve multiple steps, clarifications, or interruptions. In those cases, accuracy drops sharply. Conversation correctness fell to 14.1% for GPT-4o, 10.4% for Claude 3 Sonnet, 7.3% for Mistral-NeMo, and just 4.2% for GPT-4. Failures included losing track of context, issuing unnecessary API calls, or abandoning tasks before completion. This gap highlights how brittle current models are once dialog complexity approaches real-world support scenarios.
That pattern aligns with our internal QA: agents can automate common workflows reliably, but struggle when conversation flow deviates from ideal scenarios—notably on interruptions or context jumps (arXiv, Flow AI).
Engineering implications for Zendesk AI Agents
For those building within Zendesk or similar platforms, the implications are clear:

This approach ensures that as you update workflows or procedures in Zendesk, your tests update in lock-step—avoiding drift and preventing pipeline mismatches.
Future directions and integration ideas
Beyond customer support: While this work centers on support use cases, the same pipeline works for internal employee service, HR onboarding, compliance workflows, and more.
Live evaluation during deployment: Embed test injection in staging pipelines—run conversation flows automatically as part of CI/CD to validate agent updates before they ship.
Deeper skill validation: Extend testing into memory retention, tool chaining, policy-based escalation, and adversarial resilience—areas where agents still struggle.
Metrics beyond success/failure: Capture coherence drift, tone consistency, API timing, and policy compliance—much like the visibility controls in Zendesk’s reasoning pipeline agent builder (arXiv, openai.com).
TL;DR: Four lessons that should guide your strategy
- Do not confuse single-turn success with conversational reliability. Agents that score above 90% on individual tool calls still succeed in only ~10–15% of full conversations. In production, it is the multi-turn flow that matters.
- Ground your testing in real procedures. Synthetic benchmarks are not enough. The most reliable agents are tested against the same structured workflows your teams actually run—refunds, escalations, account updates—so evaluation evolves as your business does.
- Resilience is the true bar for readiness. The difference between a lab demo and a live deployment is how agents behave under pressure—interruptions, clarifications, policy boundaries, even adversarial inputs. That is where trust is won or lost.
- Use benchmarks to guide—not just grade—decisions. Zendesk ALMA shows where today’s models are strong and where they break. Leaders should use those insights to decide when to automate, when to hold back, and where to design seamless human-in-the-loop handoffs.
At Zendesk, where AI agents autonomously resolve complex issues, aligning testing with real human-agent conversational flow is essential. The methodology outlined here provides a practical template to do just that – grounded in procedures resilient to interruptions, and able to surface where agents break down. All of this ensures agents stay reliable even as user behavior varies.
For any business building with AI, the takeaway is clear: test at the level you plan to deploy, and measure reliability across the full dialog, not just the endpoint.
