re:Invent 2025 - Implementing Human-in-the-Loop controls for multi-agent AI systems
Multi-agent AI systems are reaching production faster than teams can build the trust to run them fully autonomously. This session provides a practical framework for where human oversight fits in agentic architectures, which AWS services support it natively, and how to design toward progressively less intervention over time.
At re:Invent 2025, Dhiraj Mahapatro, Principal Specialist Solutions Architect for Amazon Bedrock and AgentCore at AWS, walked through this progression with concrete examples and real tooling. In this post, we'll explore the decision points that call for human involvement, the specific implementation patterns available today, and how to move from high oversight toward justified autonomy as your systems mature.
Why multi-agent systems need deliberate human oversight
Most teams arrive at multi-agent architectures through incremental steps. You start with LLM inferencing, add chat history for context, layer in retrieval-augmented generation (RAG) with a vector store, and eventually find that static context isn't enough. Real-time data requires tools. Tools require agents. And once you have agents, you typically need multiple agents coordinating across workflows.
That final step changes the complexity profile significantly. A single agent talking to AWS Lambda functions is manageable. Add memory, Model Context Protocol (MCP) server integrations, agent-to-agent (A2A) communication, identity management, and Amazon CloudWatch observability, and you have a system where no single person can reason about every possible execution path. This isn't a design flaw; it's the nature of distributed systems. What it means in practice is that you need deliberate checkpoints where a human can verify the outcome before the system proceeds.
Mahapatro was direct about accountability: autonomous agents don't absorb responsibility when things go wrong in production. The developer or application owner does. HITL controls aren't about limiting what agents can do; they're about making autonomous systems something you can genuinely own.
The four situations that most consistently warrant a human checkpoint are high-stakes decisions, irreversible actions, regulatory requirements, and the trust-building phase of early deployment. For high-stakes decisions, the example Mahapatro used is a doctor using AI to assist with prescriptions. The agent might produce a reasonable suggestion, but the doctor needs to approve it, because the consequences of a wrong answer extend beyond what the system can account for. For irreversible actions, the logic is simpler: a financial transaction that can't be undone, a production resource that can't be restored. The cost of a human checkpoint is trivial compared to the cost of a wrong automated action. Some industries also require documented human oversight by law or policy, which means HITL is a compliance requirement regardless of your confidence in the underlying model. And for teams just starting with agents, Mahapatro recommended beginning with maximum human involvement and reducing it only as the system earns that trust through observed, consistent behavior in production.
Implementation patterns for HITL controls
Several tools let you implement these checkpoints without building custom polling or callback infrastructure from scratch.
The MCP protocol includes a built-in capability called elicitations, which lets an MCP server pause mid-execution and request additional input before continuing. In Mahapatro's example, a request for flight status triggers an elicitation asking for the flight number. The same pattern applies anywhere you need a human to confirm or supply something before the server proceeds. This is useful because it keeps the pause logic inside the protocol rather than in application code.
AWS Step Functions has a wait-for-callback pattern that handles HITL at the workflow level. When execution reaches a task requiring approval, the workflow emits a task token to an external target, such as an email, a Slack message, or a custom interface, and then pauses. The workflow remains paused until the token comes back with a success or failure signal, at which point execution resumes and routes based on the outcome. Mahapatro also mentioned AWS Lambda durable functions, announced during re:Invent 2025, which provides similar wait-and-resume capability for Lambda-based orchestration.
For teams building with LangGraph, the framework provides interrupt mechanisms at the graph execution level. A node can pause and surface the current execution state to a reviewer, who can approve the next step, reject it and redirect to an alternative branch, or modify the underlying state before continuing. That last capability is particularly useful: if the next action would be based on a value the human wants to change, they can update the state directly and the subsequent node receives the corrected input. Reviewers can also take control of tool selection, approving or substituting the tool the agent chose on a per-execution basis.
For A2A workflows, Mahapatro demonstrated a travel agent architecture where an A2A client in Lambda calls a weather agent behind Amazon API Gateway, which returns a push notification. The Lambda response path is a natural place to inject a Step Functions checkpoint before the final output reaches the user, keeping oversight at the boundary between agents rather than buried inside any one of them.
Building toward progressive autonomy
HITL is a starting point, not a permanent state. The design goal is to begin with high human involvement and remove it incrementally as evidence accumulates that the system behaves as intended.
Three signals indicate that a checkpoint is ready to be automated. If an agent consistently produces outputs above a confidence threshold you've defined, manual review of those outputs stops providing value. If your audit trails and operational metrics show the system performing as expected across a meaningful volume of production traffic, you have the data to justify reducing oversight for that class of action. And if you've built a feedback loop where human corrections flow back into the system, the rate of corrections should decline over time as the agent incorporates that signal.
Amazon Bedrock AgentCore evaluations, in preview at the time of the talk, support this trajectory by providing structured measurement of agent behavior over time. As evaluation maturity increases, the proportion of decisions that need human review decreases. Mahapatro framed this through the concept of a "centaur" system from a Harvard research paper, a reference to the half-human, half-horse figure from mythology. The most effective outcomes come not from pure automation or pure human review, but from humans and agents dividing work according to where each is most reliable, and progressively rebalancing that division as trust is established.
The practical takeaway is to design HITL in from the start, identify the specific checkpoints where human judgment adds the most value, use the tooling available rather than building custom infrastructure, and track the metrics that tell you when each checkpoint has earned the right to be removed. Start conservative. Let the data make the case for autonomy.
Watch the full session recording: CNS428 - Implementing Human-in-the-Loop controls for multi-agent AI systems
- Language
- English
Relevant content
- asked 2 years ago
- asked 6 years ago
- asked 6 months ago
AWS OFFICIALUpdated 5 months ago
AWS OFFICIALUpdated a year ago