Ongoing service disruptions
For the most recent update on ongoing service disruptions affecting the AWS Middle East (UAE) Region (ME-CENTRAL-1), refer to the AWS Health Dashboard. For information on AWS Service migration, see How do I migrate my services to another region?
Agentic Observability for Enterprise Workloads: Leveraging AWS DevOps Agent for Multi-APM Integration
Enterprise SRE teams in multi-APM environments waste critical incident response time manually correlating conflicting signals across Datadog, New Relic, and Splunk — directly increasing MTTR and business impact. This article demonstrates how AWS DevOps Agent autonomously correlates multi-APM telemetry via MCP integrations, using a payment processing incident scenario to walk through Agent Space setup, IAM configuration, and intelligent root cause identification.
Overview
Enterprise organizations operate in multi-APM tool environments for several reasons: each APM tool has specialized strengths for different observability layers (infrastructure metrics, application traces, log analytics), organizations inherit different tooling through mergers, acquisitions, and team-level decisions, and maintaining multiple vendors help to avoids lock-in to any single provider's ecosystem. For example, a team may use DataDog for infrastructure monitoring, New Relic for application performance and Splunk for security and log analytics.
The trade-off is operational complexity, when incidents occur, Site Reliability Engineer (SRE) team must manually correlate telemetry across these disparate tools: infrastructure metrics in one dashboard may show healthy systems while application traces in another reveal timeout errors. This manual correlation is time-intensive and error-prone, directly increasing Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR). For business-critical systems, every minute of delayed root cause identification translates to failed transactions, revenue loss, and eroded customer trust.
This article demonstrates how AWS DevOps Agent's telemetry integration (Refer-"Connecting telemetry sources" in AWS DevOps Agent User guide) capabilities enable intelligent incident response, eliminating manual correlation overhead while maintaining enterprise-grade observability for workloads across industries—including but not limited to banking, payments, and capital markets - where multi-APM environments and strict compliance requirements demand rapid and accurate incident resolution.
The Challenge: SRE Teams and Multi-APM Complexity
The traditional approach relies on manual log collection and limited APM integration, requiring engineers to correlate data through complex queries across disparate systems. This fragmented process significantly increases MTTR, often causing business impact during critical incidents. Enterprise teams need autonomous systems that correlate multi-source telemetry, resolve conflicting signals, and provide actionable insights—capabilities that define Agentic AIOps. Without intelligent automation, SRE teams spend valuable time on correlation rather than resolution, delaying recovery, and thus impacting customer experience.
The Incident
The following scenario illustrates a representative incident pattern. Actual resolution times vary based on environment complexity, tool configuration, and incident severity.
Consider a Global Financial Services organization running a real-time payment processing platform on AWS. Their SRE team monitors the platform using DataDog for infrastructure, New Relic for application performance, and Splunk for security and log analytics. At 2:47 AM, their payment gateway experiences intermittent transaction failures, triggering a PagerDuty alert.
The on-call SRE engineer begins manual triage:
- Opens DataDog—infrastructure metrics appear normal (CPU, memory, network all within thresholds)
- Switches to New Relic—application traces indicates timeout errors on the payment service
- Checks Splunk—no security anomalies detected
The engineer now faces the core multi-APM challenge: infrastructure says healthy, application says failing. They must manually correlate signals across three separate dashboards, build mental models of transaction flow, and determine which tool's data to dive deep. After extended investigation across these tools, they identify database connection pool exhaustion as the root cause. During this manual correlation window, failed transactions accumulate, requiring manual reconciliation.
This incident is not isolated. Over the past quarter, the team has seen MTTR steadily increase as their application footprint has grown across more services and more monitoring tools. Each incident requires the same manual correlation of dance—switching between dashboards, building ad-hoc queries, and reconciling conflicting signals. The team proposed building a custom correlation layer to aggregate telemetry from their APM tools into a unified view, but the effort was not approved: the estimated development timeline, ongoing maintenance burden, and the need to keep pace with each APM vendor's API changes made it impractical. They need a managed solution that integrates with their existing tools without requiring custom development.
Discovering AWS DevOps Agent
At AWS re:Invent 2025, the SRE Lead sees the AWS DevOps Agent announcement—a frontier agent designed to resolve and proactively prevent incidents while continuously improving system reliability. Among its broad set of capabilities—including proactive recommendations, CI/CD pipeline integration(Refer - "Connecting to CI/CD pipelines" in AWS DevOps Agent User guide), and code repository analysis—the built-in multi-APM telemetry integration stands out as directly addressing their correlation challenge. The service connects natively to the APM tools the team already uses, operates autonomously around the clock, and requires no custom integration development. They decide to evaluate it.
Setting Up: From Evaluation to Production
Configuring the Agent Space
The team begins by creating an Agent Space (Refer - "What are DevOps Agent Spaces?" in AWS DevOps Agent User guide)—a logical container that defines what AWS DevOps Agent can access and investigate. Since the organization operates multiple lines of business having own accounts (LOB) . (Refer - "Connecting multiple AWS Accounts" in AWS DevOps Agent User guide) , they configure separate Agent Spaces for each: retail banking, capital markets, and payments. Each LOB's SRE team owns and manages their respective Agent Space. This design leverages the service's three-level isolation model:
- AWS account isolation — Each Agent Space uses dedicated IAM roles (Refer - "DevOps Agent IAM permissions", Under AWS DevOPs Agent Security Section" in AWS DevOps Agent User guide) granting access only to explicitly configured AWS accounts and resources. The agent cannot access resources outside its configured scope.
- User access isolation — Each LOB's SRE team controls which users and groups can access their Agent Space, aligning with organizational boundaries. Authentication is managed through AWS IAM Identity Center (Refer - "Setting Up IAM Identity Center Authentication" , Under AWS DevOPs Agent Security Section in AWS DevOps Agent User guide) or IAM.
- Data isolation — Investigation data, incident history, and recommendations are maintained separately within each Agent Space. Information from one space is not visible or accessible from another.
This ensures the payments team's investigation data never crosses into the capital markets team's space, meeting the organization's regulatory requirements for data segregation. (Ref: Best Practices for Deploying AWS DevOps Agent in Production Blog)
Granting AWS Resource Access
For each Agent Space, the LOB's SRE team creates IAM roles with read-only permissions (Refer- "Limiting Agent Access in an AWS Account, Under AWS DevOPs Agent Security Section" in AWS DevOps Agent User guide) scoped to the services and resources relevant to their line of business. Following least-privilege principles, they use specific IAM actions rather than wildcards, and scope resources to their specific accounts and environments:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:DescribeAlarms", "logs:GetLogEvents", "logs:FilterLogEvents" ], "Resource": [ "arn:aws:cloudwatch:<region>:<account-id>:alarm:*", "arn:aws:logs:<region>:<account-id>:log-group:/aws/eks/payments-*:*", "arn:aws:logs:<region>:<account-id>:log-group:/aws/lambda/payments-*:*" ] }, { "Effect": "Allow", "Action": [ "ec2:DescribeInstances", "ec2:DescribeInstanceStatus" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/LOB": "payments", "aws:ResourceTag/Environment": "production" } } }, { "Effect": "Allow", "Action": [ "lambda:GetFunction", "lambda:GetFunctionConfiguration" ], "Resource": "arn:aws:lambda:<region>:<account-id>:function:payments-*" } ] }
Replace <region> and <account-id> with your specific values. Adjust resource ARN patterns and tag conditions to match your naming conventions and organizational structure.
They also restrict regional access to the regions where their workloads run. (Refer- "Limiting Agent Access in an AWS Account, Under AWS DevOPs Agent Security Section" in AWS DevOps Agent User guide)
Connecting APM Tools
Next, the payments SRE team connects their APM tools to the payments Agent Space. AWS DevOps Agent integrates with each provider through their remote MCP (Model Context Protocol) server—no custom integration code required:
- DataDog (Refer - "Connecting DataDog" , Under Connecting telemetry sources" in AWS DevOps Agent User guide)— configured with the team's DataDog API credentials
- New Relic (Refer - "Connecting New Relic", Under Connecting telemetry sources" in AWS DevOps Agent User guide) — configured with the team's New Relic API key
- Splunk (Refer- "Connecting Splunk", Under Connecting telemetry sources" in AWS DevOps Agent User guide) — configured with the team's Splunk credentials
For organizations using Dynatrace (Refer- "Connecting Dynatrace", Under Connecting telemetry sources" in AWS DevOps Agent User guide), AWS DevOps Agent provides a richer 2-way integration via an AWS-hosted MCP server, including topology mapping and status updates published back to the Dynatrace UI. (Refer - Resolve application issues autonomously with AWS DevOps Agent (Preview) and Dynatrace Blog)
All APM connections are configured within the Agent Space's Capabilities settings.(Refer- "Connecting telemetry sources" in AWS DevOps Agent User guide). The agent queries each tool's MCP server on-demand during investigations—teams do not need to build custom integration pipelines, deploy intermediary data aggregation layers, or maintain ETL processes to normalize telemetry across tools.
Connecting Incident Response Tools
The team also connects their incident response workflow tools:
- PagerDuty (Refer - "Configuring capabilities for AWS DevOps Agent" in AWS DevOps Agent User guide) — configured to automatically trigger AWS DevOps Agent investigations via webhooks (Refer- "Starting Investigations" under DevOps Agent Incident Response in AWS DevOps Agent User guide) when alerts fire
- Slack (Refer- "Connecting Slack" under Connecting to ticketing and chat in AWS DevOps Agent User guide) — configured to receive investigation findings, root cause analyses, and mitigation plans in the team's incident channel
Topology and Readiness
Once all connections are in place, the agent automatically builds Topology Graph. (Refer- "What is a DevOps Agent Topology" in AWS DevOps Agent User guide) , mapping resource relationships across their EKS clusters, databases, and services. This topology enables the agent to understand dependencies during investigations—not just individual metrics, but how components relate to each other.
The team also verifies that all data is encrypted in transit using TLS and at rest with AWS-managed encryption, and that investigation findings are logged with complete audit trails in the agent journal—providing the compliance record required for their regulated environment.
The payments Agent Space is now ready for production.
The Same Incident, Resolved
Two weeks later, at 2:47 AM, the same pattern occurs: intermittent payment gateway failures trigger a PagerDuty alert. This time, the response is different.
The PagerDuty alert automatically triggers an AWS DevOps Agent investigation in the payments Agent Space. Within minutes, the agent:
- Correlates multi-APM signals — The agent queries DataDog (infrastructure healthy), New Relic (application database timeouts), and Splunk (no security events) through each provider's MCP server simultaneously—the same three tools the engineer manually checked one by one in the previous incident.
- Resolves conflicting signals — The agent recognizes the same contradiction the engineer faced: infrastructure metrics report healthy status while application traces show timeout errors. It queries application-layer telemetry (traces, error rates, transaction latencies) from New Relic alongside infrastructure-layer metrics (CPU, memory, network) from DataDog. When these signals conflict, the agent weighs application-layer evidence—transaction-level error traces and connection pool metrics—more heavily than aggregate infrastructure health checks, because infrastructure averages can mask application-specific bottlenecks.
- Identifies root cause — The agent detects database connection pool exhaustion—the same root cause as before—but arrives at it through correlated metrics across all three APM tools rather than manual dashboard switching.
- Provides actionable recommendation — The agent suggests scaling the connection pool with specific configuration changes.
- Maintains compliance audit — The entire investigation is logged to the immutable agent journal with complete audit trails, meeting the organization's regulatory requirements.
The SRE team receives a Slack notification with complete root cause analysis and a recommended fix—significantly faster than the extended manual correlation that characterized their previous incidents. They implement the connection pool scaling, and the system recovers. The autonomous investigation dramatically reduces incident duration and the number of failed transactions compared to the previous manual correlation approach, where the prolonged investigation window allowed failures to accumulate.
Reference: Commonwealth Bank of Australia tested AWS DevOps Agent with a complex network and identity management issue—the type that "can take a seasoned DevOps engineer hours to identify"—and the agent found the root cause in under 15 minutes. (Source: AWS DevOps Agent Product Page)
Benefits and Business Impact
AWS DevOps Agent delivers measurable improvements for enterprise operations:
- Reduced MTTR — Autonomous investigation starts immediately when alerts fire, reducing the time from alert to root cause identification — Commonwealth Bank of Australia resolved a complex network and identity management issue in under 15 minutes, a task that typically takes a seasoned engineer hours to diagnose manually.
- Improved Operational Efficiency — Teams shift from reactive firefighting to proactive operational management, freeing engineers from repetitive investigation tasks.
- Compliance Readiness — Automated audit trails via the immutable agent journal and data isolation across Agent Spaces support regulatory requirements.
- Works Within Existing Workflows — Integrates with existing APM tools and processes without disruption—no replacement required.
Pattern Scope and Limitations
What This Pattern Enables: This pattern supports prompt-driven investigations using natural language, multi-APM correlation across Datadog, New Relic, Splunk, Dynatrace, and CloudWatch via MCP server integrations, and intelligent conflict resolution for contradictory monitoring signals. It also provides compliance-ready audit logging, data residency controls, and encryption, along with multi-LOB operations through isolated Agent Spaces with three-level isolation — AWS account, user access, and data.
What This Pattern Does Not Provide: This pattern does not replace existing APM tools — it works alongside them. It does not perform automatic application-level remediation; it identifies issues but does not execute code changes. It does not migrate historical data, as its focus is on real-time and forward-looking observability. Organizations also retain responsibility for APM tool licensing, and while the pattern provides compliance-ready features, regulatory approval remains organization-specific and is not guaranteed.
Key Takeaway
AWS DevOps Agent transforms observability for enterprise teams by eliminating the traditional trade-off between comprehensive monitoring and operational efficiency. Through autonomous multi-APM correlation and intelligent conflict resolution, organizations achieve rapid incident response while maintaining compliance standards required in regulated industries.
In the next article in this series, we will explore how AWS DevOps Agent's proactive recommendations, custom MCP server integrations (Refer - "Connecting MCP Servers" under Configuring capabilities for AWS DevOps Agent in AWS DevOps Agent User guide) , and detailed mitigation plans extend beyond incident response to prevent future incidents (Refer - "Preventing future incidents" in AWS DevOps Agent User guide) and integrate with your broader DevOps toolchain.
This article was co-authored by Krish Balaraman, Sr. Enterprise Support Manager, AWS Enterprise Support.
