How do I automatically trigger AWS DevOps Agent investigations from CloudWatch Alarms and EventBridge rules for ECS tasks?
I want to automatically invoke AWS DevOps Agent investigations when a CloudWatch Alarm enters ALARM state or when Amazon EventBridge detects an Amazon Elastic Container Service (Amazon ECS) task failure. This article walks through configuring both scenarios using AWS Lambda and HMAC-authenticated webhooks
DevOps Agent service allows external systems to automatically trigger investigation through Webhooks. The most commonly requested type of integration of AWS services with DevOps agent is Amazon CloudWatch Alarm. Whenever a CloudWatch Alarm is in an Alarm state, the DevOps Agent should trigger an investigation and provide details about the alarm state. Similarly, the DevOps Agent also triggers based on Amazon EventBridge rules in scenarios such as CodePipeline status changes, Amazon Elastic Container Service (Amazon ECS) Task status changes, or unauthorized API calls.
DevOps Agent Service supports two types of Webhooks:
- Bearer token authentication --> Used by services like Splunk, Datadog, New Relic, and ServiceNow
- HMAC Authentication (Generic) --> Used by services that are not natively covered by the DevOps Agent and support HMAC Authentication, such as a AWS Lambda function.
This article covers two scenarios:
- CloudWatch Alarm based Investigation
- EventBridge based Investigation
Prerequisites: Configure DevOps Agent Webhook
Before configuring either scenario, create a DevOps Agent space and webhook:
Creating an Agent Space:
- Navigate to the AWS DevOps Agent console in the US East (N. Virginia) Region and create an Agent Space.
Configuring a Webhook in Agent Space:
- Once the space is created, click on "View Details" from the DevOps Agent console.
- In the Agent Space page, click on Capabilities and scroll down to the Webhooks section.
- Click Add and generate the webhook.
- The system generates an HMAC key pair. Securely store the generated key and secret.
- Save the Webhook URL along with the Secret Key → These will be used by the Lambda function in both scenarios.
1. CloudWatch Alarm based Investigation
A. Architecture Components :
- The CloudWatch Alarm will invoke the Lambda function when in an Alarm state.
- A Webhook is configured in the DevOps Agent → the Webhook URL and Secret Key are obtained.
- The Lambda function will trigger the DevOps Agent using the Webhook URL in the DevOps Agent Payload Format.
B. Steps :
Note: Complete the Prerequisites section before proceeding.
1. Lambda Function : The Lambda function will be triggered by the CloudWatch Alarm. The Webhook URL and Secret are passed through Lambda function environment variables, and the payload is configured as per the Generic Webhook Payload format mentioned here [1].
Lambda Function to send the Payload : →
- Create a Lambda function with a runtime of Python 3.12 and a timeout of at least 30 seconds.
- The Lambda function execution role needs the following permissions. Attach this IAM policy to the role:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:DescribeAlarmHistory" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/your-function-name:*" } ] }
-
Add the two environment variables listed below, using the values retrieved in Step 1 (Webhook):
- WEBHOOK_SECRET
- WEBHOOK_URL
Note: Replace the WEBHOOK_URL and WEBHOOK_SECRET environment variable values with the values you obtained from the DevOps Agent webhook configuration in the prerequisites.
Security Note: This article uses Lambda environment variables for simplicity. For production workloads, store the webhook URL and secret in AWS Secrets Manager and retrieve them at runtime. This avoids exposing sensitive values in the Lambda console and provides automatic rotation, auditing, and fine-grained access control. For more information, see Using AWS Secrets Manager with AWS Lambda.
- Use the following code :
import json
import os
import hmac
import hashlib
import base64
import urllib3
from datetime import datetime
# Initialize HTTP client
http = urllib3.PoolManager()
# Get webhook configuration from environment variables
WEBHOOK_URL = os.environ.get('WEBHOOK_URL')
WEBHOOK_SECRET = os.environ.get('WEBHOOK_SECRET')
if not WEBHOOK_URL or not WEBHOOK_SECRET:
raise ValueError("WEBHOOK_URL and WEBHOOK_SECRET environment variables must be set")
def lambda_handler(event, context):
"""
Triggered by CloudWatch Alarm state change.
Sends webhook to AWS DevOps Agent to start investigation.
"""
print(f"Received event: {json.dumps(event)}")
try:
# Parse CloudWatch alarm event (direct Lambda invocation)
alarm_name = event.get('alarmData', {}).get('alarmName', 'Unknown')
alarm_description = event.get('alarmData', {}).get('configuration', {}).get('description') or ''
new_state = event.get('alarmData', {}).get('state', {}).get('value', 'ALARM')
reason = event.get('alarmData', {}).get('state', {}).get('reason', '')
timestamp = event.get('alarmData', {}).get('state', {}).get('timestamp', datetime.utcnow().isoformat())
region = event.get('region', 'us-east-1')
account_id = event.get('accountId', '')
# Only trigger investigation for ALARM state
if new_state != 'ALARM':
print(f"Alarm state is {new_state}, not triggering investigation")
return {
'statusCode': 200,
'body': json.dumps('Alarm not in ALARM state, skipping')
}
# Extract metric information for better context
metrics = event.get('alarmData', {}).get('configuration', {}).get('metrics', [])
metric_info = ""
if metrics:
metric = metrics[0].get('metricStat', {}).get('metric', {})
metric_name = metric.get('name', '')
namespace = metric.get('namespace', '')
dimensions = metric.get('dimensions', {})
metric_info = f"\nMetric: {namespace}/{metric_name}"
if dimensions:
metric_info += f"\nDimensions: {dimensions}"
# Build comprehensive description
description = f"CloudWatch Alarm: {alarm_name}\n"
description += f"AWS Account: {account_id}\n"
description += f"Region: {region}\n"
description += f"State: {new_state}\n"
description += f"Reason: {reason}"
if alarm_description:
description += f"\nDescription: {alarm_description}"
description += metric_info
# Create webhook payload for AWS DevOps Agent
payload = {
"eventType": "incident",
"incidentId": f"{alarm_name}-{timestamp}",
"action": "created",
"priority": "HIGH",
"title": f"CloudWatch Alarm: {alarm_name}",
"description": description,
"timestamp": timestamp,
"service": alarm_name,
"data": {
"metadata": {
"alarmName": alarm_name,
"region": region,
"accountId": account_id,
"newState": new_state,
"reason": reason,
"alarmArn": event.get('alarmArn', ''),
"metrics": metrics
}
}
}
# Convert payload to JSON string
payload_json = json.dumps(payload)
# Create timestamp for signature
event_timestamp = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z')
# Generate HMAC signature (timestamp:payload format)
signature_string = f"{event_timestamp}:{payload_json}"
signature = hmac.new(
WEBHOOK_SECRET.encode('utf-8'),
signature_string.encode('utf-8'),
hashlib.sha256
).digest()
signature_b64 = base64.b64encode(signature).decode('utf-8')
# Send webhook request
headers = {
'Content-Type': 'application/json',
'x-amzn-event-timestamp': event_timestamp,
'x-amzn-event-signature': signature_b64
}
response = http.request(
'POST',
WEBHOOK_URL,
body=payload_json,
headers=headers
)
print(f"Webhook response status: {response.status}")
print(f"Webhook response body: {response.data.decode('utf-8')}")
if response.status == 200 or response.status == 202:
return {
'statusCode': 200,
'body': json.dumps('Investigation triggered successfully')
}
else:
raise Exception(f"Webhook failed with status {response.status}")
except Exception as e:
print(f"Error: {str(e)}")
raise e
2. CloudWatch Alarm : The alarm works with any resource type metric. In this example, an alarm is created to monitor high CPU utilization for an EC2 instance and will enter an Alarm state after a particular threshold is breached.
CloudWatch Alarm : →
-
Create a CloudWatch Alarm from the CloudWatch Console. In the Specify Metric and Conditions step, fill in the required details related to the metric. For example :
-
For Step 2, Under Configure Actions, scroll down to Lambda Action
-
Select "In Alarm" for Alarm state trigger and Choose the Lambda Function that was created before in the Lambda Function section.
-
Scroll down and select Next
-
For Step 3, Provide the Alarm name along with Description
-
Click Create Alarm
Output :
Whenever the CloudWatch Alarm enters an Alarm state, a DevOps Agent investigation will be triggered, similar to the one shown below:
The DevOps Agent will then review the alarm details and provide the necessary information related to the alarm state and the reason it was triggered.
2. EventBridge based Investigation
A. Architecture Components :
- The ECS Cluster will be monitored through an EventBridge Rule.
- The EventBridge Rule will invoke the Lambda function when any task within the cluster is marked with a "STOPPED" state.
- A Webhook is configured in the DevOps Agent → the Webhook URL and Secret Key are obtained.
- The Lambda function will trigger the DevOps Agent using the Webhook URL in the DevOps Agent Payload Format.
B. Steps :
Prerequisites: Ensure the ECS Cluster has been created with tasks running, and complete the Prerequisites section above.
1. Lambda Function : The Lambda function will be triggered by the EventBridge Rule. The Webhook URL and Secret are passed through Lambda function environment variables, and the payload is configured as per the Generic Webhook Payload format mentioned here [1].
Lambda Function to send the Payload : →
- Create a Lambda function with a runtime of Python 3.12 and a timeout of at least 30 seconds.
- The Lambda function execution role needs basic execution permissions for CloudWatch Logs. Attach this IAM policy to the role:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/your-function-name:*" } ] }
- Add the two environment variables listed below, using the values retrieved in Step 1 (Webhook):
- WEBHOOK_SECRET
- WEBHOOK_URL
Note: Replace the WEBHOOK_URL and WEBHOOK_SECRET environment variable values with the values you obtained from the DevOps Agent webhook configuration in the prerequisites.
Security Note: This article uses Lambda environment variables for simplicity. For production workloads, store the webhook URL and secret in AWS Secrets Manager and retrieve them at runtime. This avoids exposing sensitive values in the Lambda console and provides automatic rotation, auditing, and fine-grained access control. For more information, see Using AWS Secrets Manager with AWS Lambda.
- Use the following code :
import json
import os
import hmac
import hashlib
import base64
import urllib3
from datetime import datetime
http = urllib3.PoolManager()
WEBHOOK_URL = os.environ.get('WEBHOOK_URL')
WEBHOOK_SECRET = os.environ.get('WEBHOOK_SECRET')
if not WEBHOOK_URL or not WEBHOOK_SECRET:
raise ValueError("WEBHOOK_URL and WEBHOOK_SECRET environment variables must be set")
def lambda_handler(event, context):
"""
Triggered by EventBridge rule for ECS task state changes.
Sends webhook to AWS DevOps Agent to investigate task failures.
"""
print(f"Received event: {json.dumps(event)}")
try:
detail = event.get('detail', {})
# Extract ECS task information
task_arn = detail.get('taskArn', 'Unknown')
cluster_arn = detail.get('clusterArn', 'Unknown')
last_status = detail.get('lastStatus', '')
desired_status = detail.get('desiredStatus', '')
stopped_reason = detail.get('stoppedReason', '')
# Extract container information
containers = detail.get('containers', [])
container_info = ""
for container in containers:
container_info += f"\n - {container.get('name', 'Unknown')}: {container.get('lastStatus', 'Unknown')}"
if container.get('reason'):
container_info += f" (Reason: {container.get('reason')})"
# Build description
description = f"ECS Task Failed\n"
description += f"Task: {task_arn}\n"
description += f"Cluster: {cluster_arn}\n"
description += f"Status: {last_status}\n"
description += f"Reason: {stopped_reason}"
if container_info:
description += f"\nContainers:{container_info}"
# Create webhook payload
payload = {
"eventType": "incident",
"incidentId": f"{task_arn.split('/')[-1]}-{datetime.utcnow().isoformat()}",
"action": "created",
"priority": "HIGH",
"title": f"ECS Task Failed: {task_arn.split('/')[-1]}",
"description": description,
"timestamp": datetime.utcnow().isoformat(),
"service": cluster_arn.split('/')[-1],
"data": {
"metadata": {
"taskArn": task_arn,
"clusterArn": cluster_arn,
"lastStatus": last_status,
"desiredStatus": desired_status,
"stoppedReason": stopped_reason,
"containers": containers
}
}
}
payload_json = json.dumps(payload)
event_timestamp = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z')
# Generate HMAC signature
signature_string = f"{event_timestamp}:{payload_json}"
signature = hmac.new(
WEBHOOK_SECRET.encode('utf-8'),
signature_string.encode('utf-8'),
hashlib.sha256
).digest()
signature_b64 = base64.b64encode(signature).decode('utf-8')
# Send webhook
headers = {
'Content-Type': 'application/json',
'x-amzn-event-timestamp': event_timestamp,
'x-amzn-event-signature': signature_b64
}
response = http.request(
'POST',
WEBHOOK_URL,
body=payload_json,
headers=headers
)
print(f"Webhook response status: {response.status}")
print(f"Webhook response body: {response.data.decode('utf-8')}")
if response.status in [200, 202]:
return {
'statusCode': 200,
'body': json.dumps('Investigation triggered successfully')
}
else:
raise Exception(f"Webhook failed with status {response.status}")
except Exception as e:
print(f"Error: {str(e)}")
raise e
2. EventBridge Rule : In the EventBridge Console, a rule will be created to monitor the ECS tasks of a particular cluster.
- In the EventBridge Console, select Rules in the left navigation pane under "Buses" or select EventBridge Rule with event pattern on the "Get Started" page
- For Step 1, provide the Rule Name and the description for it. For Event bus, select the default Event Bus and select Next
- For Step 2, Under Events : Select event source as AWS events or EventBridge partner events. For more information about ECS stream events, refer here
- Under Event Pattern, Select Custom pattern and specify the below mentioned pattern :
Note: Update the Cluster ARN in the details section above with the ARN of the ECS cluster to be monitored.{ "source": ["aws.ecs"], "detail-type": ["ECS Task State Change"], "detail": { "clusterArn": ["arn:aws:ecs:us-east-1:123456789012:cluster/demo-cluster"], "lastStatus": ["STOPPED"], "stoppedReason": [{ "exists": true }] } } - Once updated, click Next .
- For Step 3, Under Target 1, Select the Target type as AWS service and select Lambda Function in the drop-down menu.
- Specify the Lambda function that was created in Step 2 and select Next.
- If required provide the relevant Tags in Step 4
- In Step 5, review the Rule details and select Create Rule
Output :
Whenever an ECS task is marked as "STOPPED," a DevOps Agent investigation will be triggered, similar to the one shown below:
The DevOps Agent will then review the ECS task details and provide the necessary information related to the ECS state and the reason why it is in the STOPPED state :
Troubleshooting
1. CloudWatch Logs: If the Lambda function fails, the error is logged to Amazon CloudWatch Logs. Check the log group /aws/lambda/your-function-name for error details including webhook response codes and payload issues.
2. Configuring a Dead Letter Queue (DLQ): For production workloads, configure a DLQ to capture failed Lambda invocations so no events are silently lost.
-
Create an Amazon SQS queue to serve as the DLQ:
- Open the Amazon SQS console and choose Create queue.
- For Type, select Standard.
- Enter a queue name (for example,
devops-agent-webhook-dlq) and choose Create queue. - Copy the queue ARN.
-
Add the following permission to the Lambda execution role:
{ "Effect": "Allow", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:us-east-1:123456789012:devops-agent-webhook-dlq" }Note: Replace 123456789012 with your AWS account ID and the queue name if different.
-
Attach the DLQ to the Lambda function:
- Open the Lambda console and select your function.
- Choose Configuration, then Asynchronous invocation.
- Choose Edit, and for Dead-letter queue service, select Amazon SQS.
- Select the SQS queue you created and choose Save.
References
[1] DevOps Agent Webhook - https://docs.aws.amazon.com/devopsagent/latest/userguide/configuring-capabilities-for-aws-devops-agent-invoking-devops-agent-through-webhook.html
[2] Creating Amazon CloudWatch Alarms - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
[3] Creating Amazon EventBridge rules - https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule.html
[4] Amazon ECS events and EventBridge - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_event_stream.html
[5] AWS Lambda execution role - https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html
Relevant content
AWS OFFICIALUpdated 7 days ago- asked a year ago
- asked a year ago
AWS OFFICIALUpdated a year ago