Skip to content

Accelerating Incident Detection and Response onboarding with the Incident Detection and Response CLI

15 minute read
Content level: Expert
0

This article shows you how to use the AWS Incident Detection and Response Command Line Interface (CLI) to streamline workload registration, alarm creation, and Application Performance Monitoring (APM) integration.

Introduction

Many organizations want to avoid issues with critical production systems and have the quickest recovery when they do have an incident. Incident Detection and Response offers a powerful solution: When your most critical production systems go down, AWS Support contacts you within 5 minutes with full context of your environment, custom runbooks, and direct escalation paths to AWS service teams. Instead of raising a support case at 2 AM and having to escalate through your Technical Account Manager (TAM), the Incident Detection and Response team creates a bridge, pulls in the right engineers, and manages the incident end to end until your service recovers. To get these benefits, you must onboard your workloads to Incident Detection and Response and configure the workload to send the right signals. Until recently, this setup required you to coordinate multiple manual steps across Amazon CloudWatch, Amazon EventBridge, and AWS Support. The Incident Detection and Response CLI changes that. This CLI, invoked as awsidr, is a Python command line tool that automates the onboarding workflow from workload registration through alarm deployment to APM integration.

What happens when Incident Detection and Response engages

When an onboarded workload triggers an alarm, the Incident Detection and Response team's Incident Management Engineers (IMEs) evaluate the situation in real time. When an alarm momentarily triggers and recovers, such as when an unhealthy host count drops and immediately comes back because Auto Scaling replaced the instance, Incident Detection and Response sends a proactive notification:

"Critical Alarm ABC triggered and recovered. If further assistance is required, please respond to this case"

If you respond to this notification that you still need assistance, then Incident Detection and Response contacts you within 5 minutes.

During an active incident, Incident Detection and Response starts an incident bridge and loops in the required support engineers and AWS service teams. The team then works through the pre-approved runbooks to investigate and resolve the issue. You can respond to notifications and join AWS Support's incident bridge, or AWS Support can join your team's bridge. When the incident resolves, Incident Detection and Response provides a Post-Incident Report (PIR) with a root cause analysis for AWS service incidents.

The value of this process is what you don't have to do. Incident Detection and Response manages AWS coordination, brings in the right service specialists, and accelerates root cause analysis by correlating internal AWS signals with your workload telemetry.

PhaseWhat Incident Detection and Response doesWhat you do
DetectionCloudWatch alarm triggers; Incident Detection and Response monitoring system receives the eventNo action required
EngagementOpens a support case and starts a bridge, notifies your contacts within 5 minutesAcknowledge notification; join incident bridge
InvestigationEngages AWS service teams; follows runbooks and coordinates joint troubleshooting efforts occur for AWS issuesApplication troubleshooting; approve and execute remediation steps
ResolutionCloses incident; generates PIRReview PIR; implement recommendations

Solution overview

The Incident Detection and Response CLI is a Python-based command line tool that automates the entire Incident Detection and Response onboarding workflow. Instead of completing questionnaires and coordinating with multiple teams, customers run four core commands that automatically handle workload registration, alarm management, and APM integration deployment.

The commands solution consists of four primary capabilities:

  • Register-workload: Uses tags to automatically discover AWS resources and creates support cases with all the required metadata.

  • Create-alarms: Generates CloudWatch alarms based on AWS best practice thresholds for each discovered resource type.

  • Ingest-alarms: Ingests existing alarms that you previously deployed, including validation for noisy alarm patterns.

  • Setup-apm: Deploys EventBridge and webhook infrastructure for third-party APM tools.

For the solution, the Incident Detection and Response CLI operates in two modes.

Interactive mode
The interactive mode acts as a step-by-step guided wizard with real-time validation. The CLI validates inputs as you work and checks that tags return resources, alarm thresholds are sensible, and Amazon Resource Names (ARNs) are valid. This mode is ideal for first-time setup or complex configurations where you want to review and select specific resources.

Unattended mode
The unattended mode is a file-based configuration (JSON) for batch processing. After you understand the interactive mode, you can use the unattended mode to create JSON configuration files and onboard multiple workloads simultaneously. This mode also supports dry-run validation so that you can verify inputs before you create any support cases or modify resources. For sample JSON files, see Unattended mode on the GitHub website.

Note: To keep your workload aligned as your infrastructure changes, use the unattended mode to integrate into CI/CD pipelines. Add awsidr update-workload --config workload-config.json to your deployment pipeline. The CLI rediscovers resources by tag and automatically updates the workload with each deployment.

Prerequisites

  • An active AWS Unified Operation or AWS Enterprise Support subscription with Incident Detection and Response entitlement.

  • Python 3.8 or later installed on your workstation.

  • AWS Command Line Interface (AWS CLI) configured with valid credentials for the target account.

  • AWS Identity and Access Management (IAM) permissions for CloudWatch, Support API, AWS CloudFormation, and EventBridge. For more information, see IAM policies for IDR customer CLI on the GitHub website.

  • Contact information for the on-call team that Incident Detection and Response engages with during incidents.

Solution implementation

To implement this solution, complete the following tasks:

  1. Install the Incident Detection and Response CLI.

  2. Register your workload with tag-based resource discovery.

  3. (Optional) Create CloudWatch alarms based on AWS best practices.

  4. Ingest existing alarms.

  5. (Optional) Set up APM integration.

  6. Test your implementation.

Installing the Incident Detection and Response CLI

To install the Incident Detection and Response CLI for isolated package management, run the following command in a Python virtual environment:

pip install awsidr

To check the CLI installation, run the following command:

Awsidr --version

Note: To avoid conflicts with other Python packages on your system, it’s a best practice to use a virtual environment to run the preceding commands.

After your install the Incident Detection and Response CLI, the CLI displays four commands:

  • Register-workload

  • Create-alarms

  • Ingest-alarms

  • Setup-apm

Each command handles a specific part of the onboarding workflow.

Registering your workload

Workload registration can often take the most time during a traditional incident detection and response workflow. This step requires you to respond to questionnaires, provide resource and contact information, follow escalation paths, and provide architectural details. The CLI replaces this step with an interactive guided experience. To start this step in the Incident Detection and Response CLI, run the following command:

awsidr register-workload

The interactive wizard guides you through the following steps:

  1. Enter workload name: Provide a descriptive name, such as ProductionWebApp.

  2. Select Region: Select the AWS Region where your resources are deployed, such as us-east-1.

  3. Choose discovery method: Select tag-based discovery (recommended) or manual ARN selection.

  4. Enter tags: Provide tags to identify resources, such as Environment=Production,Application=WebApp.

  5. Review discovered resources: The CLI displays all resources that match your tags: Amazon Elastic Compute Cloud (Amazon EC2) instances, load balancers, Amazon Relational Database Service (Amazon RDS) databases, security groups, and more.

  6. Select resources: Select the resources that you want to include in your workload, or accept all discovered resources.

  7. Enter contact information: Provide primary and escalation contacts, such as names, email addresses, and phone numbers.
    Note: It’s a best practice to provide phone numbers so that the Incident Detection and Response team can contact you during off-hours incidents.

  8. Confirm and submit: Review the configuration and create the support case.

Example resource discovered:

Image

Figure 1: An example resource discovered.

Image

Figure 2: An example onboarding case submission summary.

The CLI creates a support case in your AWS account and attaches a JSON file that contains the following workload details:

  • Resource ARNs

  • Contacts

  • Region

  • Metadata

The IMEs that support you use this case as the single source of truth for your workload. If the case is open, then every subsequent CLI command in the process updates this same case. If the case is closed, then IMEs create a new case to provide the Incident Detection and Response team full context and traceability. Because the cases are in your account, you also have full visibility throughout the process.

For batch processing across multiple workloads, run the following command to use unattended mode with a configuration file:

awsidr register-workload --config workload-config.json

Creating CloudWatch alarms

The create-alarms command generates CloudWatch alarms directly in the CLI based on AWS best practice guidance for each resource type.

To create alarms in interactive mode, run the following command:

awsidr create-alarms

The CLI automatically completes the following tasks:

  • Identifies resource types in your registered workload.

  • Recommends appropriate alarms based on AWS best practices.

  • Provides alarms for you to review and deploy.

  • Creates alarms with proper naming conventions.

  • Defines escalation runbook contacts for when an incident occurs.

  • Attaches alarms to your workload.

  • Updates the support case with alarm details.

Important: You must tune the default alarm recommendations to meet your workload requirements. For more information, see IDR alarm recommendations on the GitHub website.

During the first week, it’s a best practice to monitor alarm behavior and adjust thresholds to reduce false positives.

Ingesting existing alarms into Incident Detection and Response

To include existing alarms in your Incident Detection and Response configurations, run the ingest-alarms command. This command registers existing alarms with the Incident Detection and Response monitoring system without recreating the alarms.

The CLI provides two discovery methods:

  • Tag-based discovery: Automatically find alarms with specific tags, and is a best practice for large environments.

  • Manual selection: Browse and select alarms from a provided list.

To ingest alarms in interactive mode, run the following command:

awsidr ingest-alarms

The interactive wizard guides you through the following steps:

  1. Select your workload.

  2. Choose CloudWatch alarms or APM alarms.
    Note: To use APM alarms, you must previously have run awsidr setup-apm.

  3. Choose tag-based discovery or manual selection.

  4. Enter tags, such as Environment=Production. Or, browse the alarm list and select your alarms.

  5. Review discovered alarms and select alarms for Incident Detection and Response to ingest.

  6. The CLI validates each alarm for noisy patterns, such as frequent state changes, constant ALARM state, and flapping, and then flags any concerns.

  7. Review your submission. The CLI creates or updates the support case with the ingested alarm details, as seen in Figure 3.

Image

Figure 3: Example output of alarm details.

Integrating third-party APM tools

The setup-apm command deploys the infrastructure that you need to route alerts from third-party APM tools into Incident Detection and Response.

The Incident Detection and Response CLI supports three integration patterns, as seen in Figure 4:

Image

Figure 4: Three APM integration patterns that Incident Detection and Response CLI supports.

Pattern 1: EventBridge SaaS partner integration
This pattern is best for APM tools that support EventBridge partner event sources, such as Datadog, New Relic, and Splunk Observability Cloud. The CLI deploys an AWS Lambda transformer that connects the partner event bus to the Incident Detection and Response custom event bus.

Pattern 2: Amazon SNS integration
This pattern is best for tools that can publish to Amazon Simple Notification Service (Amazon SNS), such as Grafana Cloud and Prometheus. The CLI deploys the Lambda function that picks up the Amazon SNS messages, transforms the messages, and then forwards the message to the Incident Detection and Response event bus.

Pattern 3: Webhook integration
This pattern is best for tools that can publish to an HTTPS endpoint, such as Dynatrace, SumoLogic, and other APM tools. The CLI deploys the full stack: Amazon API Gateway endpoint, token-based authentication through AWS Secrets Manager, Lambda authenticator and transformer, and the custom EventBridge event bus.

To deploy APM integration infrastructure, run the following command:

awsidr setup-apm

The interactive wizard guides you through the following steps:

  1. Select deployment Region: Select where to deploy the infrastructure, such as us-east-1.

  2. Choose APM provider/pattern: Select from supported providers or patterns.

  3. Review custom field mappings: The CLI uses default field mappings, such as problemId, severity, and problemTitle for Dynatrace. If your APM uses different field names, then you can customize the fields here.

  4. Review deployment plan: The CLI shows all resources that the CLI creates.

  5. Confirm deployment: A CloudFormation stack automatically deploys.

Test your implementation

Before you rely on Incident Detection and Response for production incident response, use a game day exercise to validate your configuration. This exercise confirms that alarms correctly trigger, alerts route to Incident Detection and Response, and verifies that your team's contact information is accurate.

For your game day exercise, follow these game day best practices:

  1. Verify alarm coverage: List all alarms associated with your workload.

  2. Run the CloudWatch set-alarm-state command to manually test the alarm and confirm that the Incident Detection and Response monitoring system receives the event.

  3. Confirm that your on-call contacts receive the initial Incident Detection and Response notification.

  4. Validate that webhook or APM integrations forward alerts.

  5. Review Incident Detection and Response response times and confirm that the times meet your requirements.

To schedule a game day exercise, contact your TAM.

Cleanup

To remove Incident Detection and Response CLI resources when you reconfigure alarms, migrate APM tools, move between accounts, or clean up after testing environments. To remove resources that the Incident Detection and Response CLI deployed, run the following commands:

CloudWatch
To list the alarms created for your workload, run the following command:

aws cloudwatch describe-alarms --alarm-name-prefix "YourWorkloadName"

To delete specific alarms, run the following command:

aws cloudwatch delete-alarms --alarm-names "alarm-name-1" "alarm-name-2"

APM integration infrastructure
To delete the CloudFormation stack that setup-apm deployed, run the following command:

aws cloudformation delete-stack --stack-name “Name_of_IDR_APM_stack_deployed” --region us-east-1

To verify that you deleted the resources, run the following command:

aws cloudformation describe-stacks --stack-name “Name_of_IDR_APM_stack_deployed”

Note: Deleting CloudWatch alarms and APM infrastructure doesn’t automatically offboard the workload from Incident Detection and Response. To complete the offboarding process, submit a support case.

Troubleshooting

Permission denied errors

This error occurs when your IAM user or role lacks the required permissions to create CloudWatch alarms, register support cases, or deploy CloudFormation stacks. Or, this error can occur when AWS CLI credentials aren’t correctly configured.

To resolve this issue, complete the following tasks:

  • Verify that your IAM permissions include all required actions.

  • Confirm that your AWS CLI credentials are correctly configured.

Tag discovery returns zero resources

This error occurs when the tags that you specify don't exist on any resources in the selected Region, or the Region doesn't match where you deployed your workload. Or, the tag key-value pairs don't match.

To resolve this issue, complete the following tasks:

  • To verify that the tags exist on your resources, run the following command:
aws resourcegroupstaggingapi get-resources --tag-filters Key=Environment,Values=Production
  • Check that the Region selection matches the Region where you deployed your resources.

  • Make sure that your tag key and value pairs exactly match.

CloudFormation stack deployment fails

Stack deployment can fail when you reach service quota limits for EventBridge rules or API Gateway resources. Or, you encounter naming conflicts with existing resources or face other constraints that are specific to CloudFormation.

To resolve this issue, complete the following tasks:

  • Review stack events in the AWS Management Console for specific error messages.

  • Verify service quotas for Lambda, API Gateway, and EventBridge in the target Region.

  • Make sure that there are no naming conflicts with existing CloudFormation stacks or resources.

APM webhook authentication fails

Authentication errors occur when the token stored in Secrets Manager doesn't match your APM tool configuration. Or, errors can occur when API Gateway can’t validate incoming webhook requests from your monitoring platform.

To resolve this issue, complete the following tasks:

  • Retrieve the authentication token from Secrets Manager and verify that the token matches the token you configured in your APM tool.

  • Check API Gateway logs for authentication errors.
    Note: You must turn on CloudWatch logging before you can monitor for errors.

Conclusion

The Incident Detection and Response CLI removes the manual overhead from onboarding critical workloads to Incident Detection and Response. By automating workload registration, alarm creation, alarm ingestion, and APM integration, the CLI reduces a multi-week manual process to a repeatable, scriptable workflow.

After you onboard your workload, Incident Detection and Response provides 24/7 proactive monitoring with a 5-minute engagement commitment. The service also makes sure that when something goes wrong in your production environment, AWS is already working on the problem before you need to raise a support case.

For questions, feedback, or to contribute to the project, see AWS Incident Detection and Response CLI on the GitHub website.

About the author

Image

Yomesh Shah
Yomesh Shah is a Senior Solutions Architect at AWS. He brings over 25 years of experience helping enterprises maximize the value of their IT investments through optimization, automation, and process improvement. He currently helps AWS customers use scalable AWS Support solutions to enhance operational resilience and incident response capabilities. Yomesh holds a patent for the design of a Managed Services control plane in the cloud (US11856055B2).