Skip to content

Minimizing startup latency with Amazon Bedrock AgentCore Runtime

16 minute read
Content level: Advanced
2

This article decomposes end-to-end latency in agentic AI applications running on Amazon Bedrock AgentCore Runtime, isolates startup (cold-start) latency as a distinct optimization target, and presents practical mitigation strategies‚ including container optimization, pre-warming via strategic pinging, multiple endpoints, and self-maintained warm pools to minimize it.

This article was written by Massimiliano Angelino, Alexander Arzhanov, Reem Gaafar and Oliver Moeller.

Latency is a primary systems concern in agentic AI applications. In interactive settings, it shapes perceived responsiveness. In tool-augmented and multi-agent workflows, it affects coordination efficiency and tail behavior across the execution path.

We begin with a brief breakdown of end-to-end agent latency, then focus on startup latency in Amazon Bedrock AgentCore Runtime. This framing helps separate startup overhead from the rest of the invocation path and makes clear which portion of latency can be improved through runtime-level optimizations. We then discuss practical approaches for reducing startup latency in AgentCore Runtime.

1. Understanding latency in agentic AI applications

It is useful to place startup latency in the context of the full agent execution path. This makes it easier to separate runtime overhead from the other components of end-to-end latency.

Decomposing the agent execution path

In agentic applications, latency cannot be reduced to a single inference metric. A typical agent invocation traverses an execution path with multiple stages, each contributing separately to end-to-end latency.

At a high level, agent latency can be decomposed into runtime initialization, iterative model inference, sub-agent orchestration, tool execution, and response streaming. Each stage contributes to total latency, but the underlying mechanisms and optimization strategies differ substantially.

Runtime initialization captures startup work, including provisioning the AgentCore Runtime execution environment and loading agent artifacts. Iterative execution covers the core agent loop, where model invocations and tool or sub-agent calls are interleaved until the task is complete. Response streaming governs how quickly partial output is returned to the caller, which often shapes perceived responsiveness even when total execution time remains unchanged.

Latency as a budget: what to measure, and what to optimize

Once the execution path is decomposed, latency can be treated as a budget across runtime initialization, iterative execution, and response streaming rather than as a single aggregate number. This distinction is especially important in multi-agent systems, where delays in one stage propagate downstream and directly increase the latency of subsequent stages.

The starting point is visibility into where time is actually spent. In practice, that means answering a small set of questions:

  • How much time each agent spends in runtime initialization versus active execution
  • Where model latency sits in the overall budget, including time to first token (TTFT), token generation rate in tokens per second (TPS), and end-to-end response latency
  • Which tool calls dominate execution time and whether they block downstream work
  • Whether tail latency is driven primarily by cold starts, model behavior, or external dependencies

Equally important is clarity on the optimization objective. Some workloads are most sensitive to interactivity, often captured by TTFT. Others are governed by end-to-end completion time, where the relevant measure is total execution time. At production scale, many systems are constrained by tail predictability, typically expressed through p95 or p99 latency.

These objectives are not interchangeable. A workload can exhibit strong interactivity and still perform poorly on completion time or show low median latency while remaining unstable in the tail. Effective optimization therefore begins with decomposition, measurement, and an explicit choice of which latency objective matters most.

2. AgentCore Runtime startup latency

Understanding startup latency in AgentCore Runtime requires examining how execution environments are managed across the session lifecycle. AgentCore Runtime runs agent code in isolated microVMs, and startup time depends on whether an execution environment is already available or must be provisioned on demand. This section covers the session lifecycle, deployment mode tradeoffs, startup behavior, and the measurements needed to isolate startup overhead.

Session lifecycle

AgentCore Runtime manages execution environments at the session level. Each session is associated with an isolated microVM that transitions through three states over its lifetime: Active, Idle, and Terminate.

In the Active state, the microVM is executing handler code or processing requests. After execution completes, the session enters the Idle state, where the microVM remains provisioned but inactive. The duration of this state is controlled by idleRuntimeSessionTimeout, which defaults to 15 minutes and can be configured between 60 and 28,800 seconds (8 hours). While the session remains idle, subsequent invocations for the same session can reuse the existing execution environment and therefore avoid startup overhead.

The session transitions to the Terminated state when the idle timer expires, when the session reaches its maximum lifetime, or when it is explicitly terminated through the API. The maximum session's lifetime defaults to 8 hours and cannot exceed that value. From a latency standpoint, this state model determines whether a request is served by an already provisioned environment or requires a new one to be created. That distinction is the basis for the startup behavior discussed next.

Deployment modes and their latency implications

AgentCore Runtime supports both code and container deployment, and the two modes differ primarily in how much work must be completed before the handler can execute.

With code deployment, the agent and its dependencies are packaged as a ZIP archive and run in a managed Python environment. Because the runtime setup is comparatively lightweight, code deployments generally have a lower startup baseline than container deployments.

With container deployment, the agent is packaged as a container image in Amazon ECR. This offers greater control over the software environment, but it also results in a higher startup latency baseline than code deployment because the image must be retrieved, its layers materialized, and the runtime initialized before execution begins.

From a latency perspective, both modes follow the same execution model inside AgentCore Runtime. What changes is the amount of provisioning and initialization work on the startup path.

Cold-start anatomy: where the time goes

An invocation in AgentCore Runtime can follow one of two paths, depending on whether an execution environment is already available for the session.

A cold start occurs when no pre-initialized environment exists. The platform must first provision a microVM, load the deployment artifact, initialize the runtime environment, and execute startup logic before the handler can run. Together, these steps define startup overhead.

The relative contribution of these steps depends on where initialization work resides. Part of the startup path is platform-managed, including microVM provisioning and downloading of the artifacts. Another part is application-defined, such as dependency loading, agent construction, tool registration, and establishing external connections. In practice, startup latency reflects the combined effect of both.

A warm start, by contrast, occurs when a request reuses a session ID whose execution environment is still available. AgentCore Runtime can then reuse that session environment and route the request directly to the handler, bypassing the startup path. This makes startup latency in AgentCore Runtime closely tied to session ID reuse.

Figure 1 - Cold-start anatomy

Cold-start mitigation: built-in optimization

To reduce startup latency, AgentCore Runtime uses an internal optimization where pre-warmed instances are used to reduce startup latency for container deployments. These pre-initialized environments allow new sessions to begin execution without incurring the full startup path on every request.

The pre-warmed instances are instantiated when an agent runtime is created or updated, and are replenished as they gets consumed. When a request arrives for a new session, AgentCore Runtime can assign an one of these pre-warmed instances rather than provisioning one entirely from scratch. In practice, this removes a substantial portion of startup overhead from the critical path.

This mechanism is specific to container deployments. For code deployments, execution environments are provisioned on demand. As a result, startup-latency mitigation in AgentCore Runtime is partly deployment-mode specific: container deployments mitigate cold-starts thank to pre-warmed instances, whereas code deployments rely more directly on minimizing the startup path itself.

Isolating AgentCore startup latency

Before startup latency can be optimized, it must be measured in a way that separates AgentCore overhead from iterative execution latency. End-to-end response latency combines runtime initialization, iterative execution, and response streaming, making attribution difficult unless these components are measured independently.

A practical approach is to reduce variability in iterative execution during measurement, for example by replacing model calls, tool calls, or sub-agent invocations with fixed stub responses, or by measuring those components separately. Startup and steady-state paths should also be evaluated independently, since reused session environments bypass most of the startup path and reflect a fundamentally different latency profile.

In practice, four measurement principles matter most:

  • Isolate iterative execution latency from platform overhead, either by stubbing model, tool, or sub-agent calls, or by measuring them separately
  • Measure startup and steady-state paths independently, since they reflect different execution regimes
  • Use percentile metrics rather than averages, especially for production-facing and multi-agent workloads
  • Instrument at the handler boundary, so time spent before application logic begins can be distinguished from time spent inside the handler
  • Benchmark at the expected concurrency and request rate, so the measured startup profile reflects production conditions

In AgentCore Runtime, this point is particularly important for container deployments. Startup behavior depends not only on session reuse but also on whether incoming requests can be routed to pre-initialized environments. Because that capacity replenishes over time rather than instantaneously, measurements collected at unrealistically low (or high) traffic levels may understate startup exposure relative to production.

These principles establish the baseline needed to evaluate startup optimizations meaningfully. Without that separation, it becomes difficult to determine whether an observed improvement comes from the runtime, the iterative execution path, or the application itself.

3. Mitigation strategies

This chapter will discuss several approaches you can implement to optimize cold start latency impact on your applications.

Optimizing container size and code initialization

Recall that cold-start latency has two components: a platform-managed portion (microVM provisioning, artifact download, layer materialization) and an application-defined portion (dependency loading, agent construction, tool registration, connection setup). This gives you two levers: artifact size (smaller packages download and materialize faster) and initialization logic (less pre-handler work means new sessions serve traffic sooner).

Choosing code vs. container deployment

Both code and container deployment modes behave identically once the handler is running; they differ in how much the platform does before that point.

  • Code deployment packages the agent as a ZIP file and runs it in a managed Python runtime. On each new session, the runtime downloads the code from S3 onto a pre-warmed microVM running the base image. It offers a lower startup baseline and a higher new-session creation rate than container deployment, but is limited to Python 3.10+, ≤250 MB, and the AgentCore-managed OS and runtime. You can review the limits here.

  • Container deployment uses an ARM64 image from Amazon ECR, supporting up to 2 GB and full control over language, base image, and system dependencies. The platform must pull the image, materialize layers, and initialize the container first — so the baseline is higher.

Container image optimization

Smaller images pull and initialize faster. Start from a slim ARM64 base (e.g. ghcr.io/astral-sh/uv:python3.11-bookworm-slim), install from a lockfile with uv sync --frozen --no-cache for deterministic, cache-free installs, and drop any dependency the agent doesn't import at runtime.

Code initialization optimization

Code at module scope runs when the container starts inside the microVM, before the first request arrives. Lightweight setups such as boto3 clients belong here. Heavier work — retrieving data, fetching prompt templates — should be kicked off asynchronously at startup, keeping the container responsive to health checks while the heavy lifting runs in parallel. Lazy loading for code paths that are rarely used is another option to optimize code initialization times.

Even with these optimizations, cold starts remain unavoidable for new sessions. The next sections cover strategies to hide or eliminate that remaining latency.

Pre-warming instances through strategic pinging

You can "hide latency behind UX" by triggering agent initialization early in the user journey. For example, generate a session ID and invoke your agent's endpoint with a ping message when a user opens the application, before they start interacting with the agent. This is a common pattern in serverless architectures and pre-warms the instance tied to the session ID so it's ready when the user makes their first request. This is a simple mitigation to implement and, in many cases, fully hides cold-start latencies. This pattern is cost-effective and works well when users interact with the agent. You can find an example implementation of this pattern here

Figure 2 – Pre-warming a session via early ping

Scaling pre-warmed instances via multiple endpoints

In the next chapter, we will talk about building your own warm-pool infrastructure. Before looking into this, consider a simpler option available for container deployments: create multiple AgentCore Runtime endpoints for the same agent. Because the pre-warmed instances are maintained per endpoint, multiple endpoints effectively multiply the number of pre-initialized environments the service keeps ready. You can then distribute traffic across endpoints in your invoke layer, using a simple round-robin strategy. This gives you additional headroom for cold-start mitigation without operating a session queue by yourself.

Maintaining a warm pool

For high-traffic applications and applications that do not necessarily rely on user interaction, you can maintain a pool of pre-warmed sessions. The architecture in Figure 3 shows the four components of a minimal reference implementation: an SQS FIFO queue holding warmed session IDs, a getSessionId Lambda that hands sessions to clients and triggers replenishment, a replenish Lambda that warms a new session and returns it to the pool, and an agent handler that short-circuits warm-up pings. The client calls getSessionId, then invokes AgentCore Runtime directly with the returned session ID.

Figure 3 - Reference Architecture for a self-maintained warm pool. The blue path is executed asynchronously and creates new session_ids for the self-maintained warm pool

1. getSession Lambda

Pulls one warmed session ID from the SQS FIFO pool, asynchronously triggers replenishing to keep the pool topped up, and returns the session ID to the caller. Note: If your client is a trusted service, you can collapse getSessionId into the client itself — receive from SQS and async-invoke replenish directly. This removes one Lambda hop.

import os
import boto3

sqs = boto3.client("sqs")
lam = boto3.client("lambda")

POOL_QUEUE_URL = os.environ["POOL_QUEUE_URL"]
REPLENISH_FUNCTION = os.environ["REPLENISH_FUNCTION"]

def handler(event, context):
    # Pull one warmed session ID from the FIFO pool.
    resp = sqs.receive_message(
        QueueUrl=POOL_QUEUE_URL,
        MaxNumberOfMessages=1,
        WaitTimeSeconds=1,
    )
    msg = resp["Messages"][0]
    session_id = msg["Body"]
    sqs.delete_message(QueueUrl=POOL_QUEUE_URL,
                       ReceiptHandle=msg["ReceiptHandle"])

    # Fire-and-forget: replenish the pool without blocking the client.
    lam.invoke(
        FunctionName=REPLENISH_FUNCTION,
        InvocationType="Event",
        Payload=b"{}",
    )

    return {"sessionId": session_id}

2. Replenish Lambda

Pings AgentCore Runtime with the "warmup" sentinel and a fresh session ID to provision a new microVM, then publishes that session ID to the SQS FIFO pool.

import json
import os
import uuid
import boto3

acr = boto3.client("bedrock-agentcore")
sqs = boto3.client("sqs")

AGENT_RUNTIME_ARN = os.environ["AGENT_RUNTIME_ARN"]
POOL_QUEUE_URL = os.environ["POOL_QUEUE_URL"]

def handler(event, context):
    session_id = str(uuid.uuid4())

    # Ping AgentCore Runtime with the sentinel payload.
    # This provisions a microVM and pins it to session_id.
    acr.invoke_agent_runtime(
        agentRuntimeArn=AGENT_RUNTIME_ARN,
        runtimeSessionId=session_id,
        payload=json.dumps({"type": "warmup"}).encode(),
    )

    # Publish the warmed session ID to the FIFO pool.
    sqs.send_message(
        QueueUrl=POOL_QUEUE_URL,
        MessageBody=session_id,
        MessageGroupId="pool",
        # session IDs are unique; disables content dedup
        MessageDeduplicationId=session_id,
    )

3. Agent handler

Short-circuits warm-up pings with a sentinel check before any LLM call, tool execution, or billed active compute runs.

from bedrock_agentcore import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

@app.entrypoint
def handler(payload):
    if payload.get("type") == "warmup":
        return {"status": "warm"}  # skip LLM, tools, billed compute

    # normal agent logic here
    ...


app.run()

4. Invoke Wrapper

Synchronously calls getSessionId to obtain a warmed session ID, then invokes AgentCore Runtime directly with that session ID so the request lands on a warm microVM.

import json
import os
import boto3

acr = boto3.client("bedrock-agentcore")
lam = boto3.client("lambda")

AGENT_RUNTIME_ARN = os.environ["AGENT_RUNTIME_ARN"]
GET_SESSION_FUNCTION = os.environ["GET_SESSION_FUNCTION"]

def invoke_with_warm_session(prompt: str) -> dict:
    resp = lam.invoke(
        FunctionName=GET_SESSION_FUNCTION,
        InvocationType="RequestResponse",
        Payload=b"{}",
    )
    session_id = json.loads(resp["Payload"].read())["sessionId"]

    return acr.invoke_agent_runtime(
        agentRuntimeArn=AGENT_RUNTIME_ARN,
        runtimeSessionId=session_id,
        payload=json.dumps({"prompt": prompt}).encode(),
    )

You control the warm pool size to match your expected demand. For example, with 10 requests per second and a 3-second replenishment time, a pool of 40 instances eliminates cold starts entirely. Traffic spikes may still cause cold starts in worst-case scenarios. The replenishment rate depends on the "New sessions create rate" which you can find in the throttling limit section of the documentation.

The snippets above illustrate the core mechanics only. A production implementation typically adds a heartbeat loop to keep pooled sessions alive past the idle timeout, retry logic on failed pings, graceful degradation when the pool is empty under burst traffic, and metrics to right-size the pool.

Figure 4 – Latency comparison: on-demand vs. pre-warmed sessions

Benchmark results: In our benchmark in April 2026, requests assigned to pre-warmed sessions were roughly 90% faster than requests that triggered new session creation on demand. Average latency fell from about 2.9 s to about 250 ms, and median latency (p50) dropped from about 2.9 s to about 200 ms. Tail latency (p99) improved substantially as well - falling from above 4.9 s to below 500 ms. When the session pool was sized to match the workload, all requests were served from pre-warmed sessions.

Pricing Considerations

The warm-pool architecture takes advantage of Bedrock Agent Core's consumption-based pricing model, which charges separately for memory and active compute time. Since you only pay for memory when the microVM is idle, maintaining pre-warmed sessions is cost-effective and creates a predictable baseline cost that eliminates cold starts. For high-traffic applications where latency directly impacts user experience, this modest cost overhead is typically far cheaper than the business cost of cold-start delays. Monitor your actual traffic patterns to right-size the pool—you may find that fewer instances suffice during off-peak hours, allowing you to scale the warm pool dynamically and optimize costs further.

4. Conclusions

Startup latency in AgentCore Runtime splits into a platform-managed portion (microVM provisioning, artifact download) and an application-defined portion (imports, client setup, initialization). Developers get control on the second by shrinking the deployment artifact and optimizing the module level startup code so new sessions reach the entrypoint handler faster. When new session cold starts still matter, the mitigation ladder is: trigger pre-warming early in the user journey, spread traffic across multiple endpoints to expand the managed warm pool, or maintain your own FIFO pool of pre-warmed sessions.

To learn more, see the AgentCore Runtime documentation. For questions about your specific workload, reach out to your AWS account team.