In 2024, multi-agent systems were a research curiosity — interesting demos that fell apart on real workloads. By mid-2025, the picture changed. Tool-use APIs matured, agent frameworks (LangGraph, CrewAI, Autogen) shipped production-grade orchestration, and the underlying models got dramatically better at following plans without going off the rails.
We have shipped multi-agent systems in production for BFSI operations, customer service, internal IT, supply chain, and developer tooling. The teams that succeeded share a small set of patterns. The teams that failed mostly tried to apply agents where a workflow would have done the same job at a tenth of the cost.
This is what we have learned about when, where, and how to deploy multi-agent systems in real enterprises.
When a single bot is enough — and when it is not
The first decision is also the most important: should this even be an agent system?
A single LLM call is enough when:
- The task is one well-defined transformation (summarise this, classify that, extract these fields)
- The input fits in one prompt
- No external tools are needed mid-task
- The output is verified by something other than the model itself
A single agent with tools is enough when:
- The task needs 1–5 tool calls in a known sequence
- Decisions are simple ("if X is missing, look up Y")
- Failure modes are well-understood and recoverable
You actually need a multi-agent system when:
- The task naturally decomposes into specialised roles (analyst + writer + critic)
- Different parts of the task need different tools, permissions, or models
- Parallelism gives a real speedup (research multiple sources simultaneously)
- One agent's output is the prompt for another's reasoning
Most teams overshoot — they jump to multi-agent because it sounds more powerful. In practice, roughly 70% of "agentic" projects we audit could be solved with a single tool-using agent, with one-third the cost and far simpler debugging.
The 5 patterns that work in production
These are the multi-agent patterns we keep reaching for — proven across dozens of deployments.
Pattern 1 — Planner + Executor
A planner agent decomposes the user request into ordered sub-tasks. An executor agent (or pool) carries out each sub-task. The planner reviews intermediate results and replans if needed.
Use when: the work is complex but the sub-tasks themselves are straightforward. Common in customer support automation, complex form processing, multi-step data lookups.
Pattern 2 — Supervisor with specialised workers
A supervisor agent routes incoming requests to specialised worker agents — each fine-tuned, prompted, or tool-equipped for a specific domain. Think of it as smart task routing.
Use when: your enterprise has clear domains of expertise (legal vs. finance vs. operations) and incoming requests can be classified. Common in internal IT helpdesk, multi-domain Q&A, omnichannel customer service.
Pattern 3 — Hierarchical teams
A team of agents, each itself a supervisor of a sub-team, modelled after how human organisations work. Layers handle scope (strategic → tactical → operational).
Use when: the work has true hierarchical structure — a research project that fans out to multiple investigations, each of which fans out to multiple data lookups. Often overkill but elegant when it fits.
Pattern 4 — Debate / critic
Two or more agents propose solutions and a critic agent evaluates them, returning the best — or asking for revisions. Drives quality up at the cost of latency and tokens.
Use when: high-stakes outputs where quality matters more than speed (legal drafts, regulatory filings, medical decision support, code reviews). Skip when latency or cost matters most.
Pattern 5 — Swarm / parallel research
A coordinator dispatches identical queries to a pool of agents, each researching from different angles or sources. Results are aggregated by the coordinator.
Use when: research-style tasks where you genuinely benefit from multiple perspectives (competitive analysis, due diligence, multi-source verification, broad-scope discovery).
For more on why most enterprise AI projects never reach production — including agentic ones — see our analysis of 73% AI failure rates.
Tool use that actually works
The single biggest difference between an agent that demos well and an agent that ships: how it uses tools.
Three rules we enforce on every project:
1. Function calling, not parsed text. Use the model's native function calling / tool calling API. Never parse natural language outputs to extract tool invocations — it will fail in production on edge cases the demo did not cover.
2. Strict tool schemas. Define every parameter, every type, every required field. Agents follow strict schemas reliably. Loose schemas produce malformed calls and crashes.
3. Idempotent tool design. Tools should be safe to call multiple times with the same input. Agents retry — by design and by accident. Tools that are not idempotent corrupt state.
A common trap: giving an agent too many tools. We have seen agents handed 40+ tools and asked to "figure out which to use." The model gets confused and picks wrong. Cap each agent at 5–8 tools maximum. If you need more, use the supervisor + specialist pattern so each specialist sees only its own tools.
Memory architecture
Three layers, each used differently:
Short-term memory — the current conversation / task context. Held in the prompt window. Cleared between sessions.
Long-term memory — facts about the user, preferences, persistent state. Stored in a vector DB or relational store. Retrieved at session start.
Episodic memory — what the agent did last time it faced a similar problem. Used for self-improvement and few-shot examples. Stored alongside outcomes (success/failure) so the agent can learn from past episodes.
Most teams skip episodic memory and pay for it in plateaued quality. Adding even a simple "what worked last time?" lookup before each major decision lifts task success rates by 8–15%.
Guardrails — the part that matters most in production
An agent that calls tools is an agent that takes actions. Actions affect real systems, real customers, real money. Guardrails are non-negotiable.
What we put on every agent that touches production:
- Tool whitelisting per agent role — explicit, audited, reviewed
- Action budget — max number of tool calls per task; hard stop on exceed
- Spend caps — per-task token + API cost limit; circuit breaker on overspend
- Output validation — schema-checked outputs; malformed = automatic retry, not silent corruption
- Human-in-the-loop checkpoints — for any irreversible action (refund issued, email sent to customer, record deleted)
- Hallucination detection — automatic flag if the agent claims facts that are not in retrieved context
- Audit log — every prompt, every tool call, every output, every approval
The audit log is the single most under-built component on most agentic projects we audit. Skip it and you cannot answer "why did the agent do that?" when something goes wrong. Ship it from day one.
Our first agent went live without a proper audit trail. The first time it made a wrong call on a claim, we spent two weeks reconstructing what happened from logs. We never deploy an agent without audit logging now.
Observability — tracing every decision
Treating agent traces like distributed-system traces is the change that makes production agents actually maintainable.
Tools we use: LangSmith, Langfuse, Helicone, Arize Phoenix. Pick one and instrument from day one.
What you must capture per task:
- Full prompt + context window for every model call
- Every tool call with inputs and outputs
- Latency at each step
- Token usage at each step
- Outcome: success / partial / failure / human-overridden
- User feedback (thumbs up/down minimum)
Without this, you have no way to diagnose why an agent's quality dropped this week, or to A/B test prompt changes, or to spot a regression after a model upgrade.
Real use cases we have shipped
BFSI: claims triage agent
A supervisor routes incoming insurance claims to specialist agents (auto, health, property). Each specialist reads the claim, retrieves policy + history, evaluates against rules, and proposes an outcome. A human approver sees the proposal + reasoning + sources. Result: 78% of straightforward claims resolved without human intervention; average handle time down 64%.
Customer support: omnichannel resolution
A single agent handles email + chat + voice transcript inputs, decides whether to resolve directly or escalate, drafts the response, and updates the CRM ticket. Agent has tools for KB lookup, order history, and refund issuance (capped at $50 without approval). Result: 41% of tickets resolved automatically; CSAT up 6 points.
Internal IT: dev environment troubleshooter
A planner agent decomposes "my dev environment is broken" into diagnostic sub-tasks. Worker agents check git state, dependencies, environment variables, and recent commits. The planner correlates findings and suggests a fix. Result: mean time to resolution down 51% on common issues.
Supply chain: vendor risk monitoring
A swarm of research agents continuously monitors news, financial filings, and shipping data for risk signals across 1,200+ vendors. A coordinator aggregates and ranks risks weekly. Result: earlier detection of 3 supplier failures that would otherwise have surfaced as missed deliveries.
When NOT to use agents
A short, opinionated list:
- Simple lookup tasks. A function call beats an agent every time.
- Latency-critical user interactions under 500ms. Agent overhead is too high.
- Tasks with strict deterministic requirements. A workflow engine is the right tool.
- Workloads with extreme cost sensitivity. Agents are expensive per task.
- High-volume, low-value transactions. The fixed agent overhead does not amortise.
If a project hits any of these, push back. The most professional answer is sometimes "you do not need an agent for this."
Final word
Multi-agent systems are now genuinely useful in enterprise production — but only when applied to problems that actually need them. The most successful teams we work with are not the most aggressive adopters. They are the ones who set a high bar, deploy agents only where simpler tools fall short, and treat guardrails and observability as non-negotiable from day one.
If you would like our team to evaluate a candidate use case, audit an existing agentic system, or design a new one from scratch, we offer a free 60-minute architecture review.

Written by
Pratik Kantesiya
AI Engineering Lead
Pratik leads AI engineering at Agile Infoways, where he architects production AI systems for enterprises across healthcare, BFSI, and logistics. He writes about practical AI delivery — what works, what does not, and what most teams miss between proof-of-concept and production.



