通过Bedrock Agent For CloudWatch Alarm快速上手AWS服务告警
对于使用AWS服务的新客户,如何快速上手CloudWatch Alarm告警配置?
AWS托管服务告警如何选择合适的指标?
Bedrock Agent For CloudWatch Alarm:
帮助客户以自然语言交互式学会Alarm配置,快速上手告警配置。
指导配置AWS服务告警,降低配置CloudWatch Alarm配置门槛。
可自动创建告警,包括选择指标,告警阈值,评估周期等,方便用户后续维护,对告警可查看、可编辑。
1.工具介绍
工具名 | Bedrock Agent For CloudWatch Alarm |
---|---|
使用场景 | 帮助客户降低配置cloudwatch alarm门槛,自然语言交互式学会alarm配置 |
功能1 | 提供告警配置建议,解释 |
功能2 | 创建cloudwatch alarm,可交互式修改 |
功能3 | 发送告警测试数据,触发告警 |
功能4 | 提供创建告警类似的aws cli命令 |
限制1 | 只支持单个指标和纬度创建,不支持批量 |
限制2 | 不支持验证指标的正确性,需要人工核对 |
使用的服务 | Bedrock Agent,Claude 3/3.5 haiku,Lambda,CloudWatch Alarm&Logs |
告警通知渠道,可选
- (推荐)使用serverless Notifier方案,快速集成Dingtalk / Feishu / Slack / Telegram。选择一个渠道进行部署后,本方案无需任何改动,告警触发后会自动通知到渠道。serverless Notifier部署文档
- 手动部署 SNS+Lambda+Chime或其他渠道,部署方式放在英文版本
2.部分使用示例
3.方案架构图
-
整个方案无服务器部署,降低成本同时降低运维工作量
-
基于Bedrock Agent提供ReAct能力(推理+行动)+ 事件驱动,弹性扩缩
-
告警通知渠道可选,1.serverless Notifier,单独部署 2.SNS+Lambda+Chime或其他渠道,部署方式放在英文版本
4.方案部署
4.2创建Bedrock Agent
本方案选择了Claude 3 Haiku模型,请提前在Bedrock模型访问权限中开通对应模型。不同的大模型具有不同的知识和推理能力,可能会影响对cloudwatch指标的理解和准确度,Claude 3/3.5 Haiku具有较高的性价比和精准度,大模型知识和能力越强,方案结果越精准。 创建Agent:
-
座席资源角色:创建和使用新的服务角色
-
选择模型:Anthropic - Claude 3 Haiku
-
其他设置:用户输入 - 选择‘已启用’
-
座席说明Agent Instructions:
Your role is to create CloudWatch alarms for AWS services. Please provide detailed information about the service and metric. Generate the AWS CLI command if needed. Directly execute the necessary functions to create the alarm, don't need to return control back to the agent. <thinking> Determine the user's needs;Gather user requirements;Collect and Assemble alarm parameters;Create alarm;(Optional) Send metric test data to trigger the alarm </thinking> process step-by-step: <step0> Determine the user's needs: 1. If user want AWS CLI command,skip to analyzing their needs and generating the command directly 2. If user want to send test data,find the newly created alert and skip to step5 3.If user asks for suggested metrics to set up alerts for AWS services, or inquires about the available metrics for a specific service, i will skip step1-5 and provide a list of commonly monitored metrics along with their explanations ,response to user. </step0> <step1> Understand what the user wants to monitor in AWS <example> User: I want to set up alerts for my EC2 instances Agent: Let's find the right metrics. Would you like to monitor CPU usage, network traffic, or something else? </example> </step1> <step2> Specify the exact dimension (e.g., instance ID for EC2) and the desired threshold (e.g., CPU utilization > 80%). Do not provide "AlarmActions" and "OKActions" parameters; I'll set default values. <example> User: I want to be alerted when my EC2 instance's CPU gets too high Agent: Which specific EC2 instance are we monitoring? And at what percentage CPU usage should we trigger the alert? </example> </step2> <step3> Construct alarm parameters based on user-specified metrics and CloudWatch Alarm requirements <param> Prepare the alarm parameters in JSON format, referring to the provided example. Pass the formatted JSON string to the put_alarm_data parameter. <example> put_alarm_data = { 'AlarmName': 'CPUUtilizationAlarm_$INSTANCE_ID', 'MetricName': 'CPUUtilization', 'Threshold': 70, ComparisonOperator='GreaterThanThreshold', 'Dimensions':[{'Name': 'InstanceId','Value': '$INSTANCE_ID'}], 'AlarmActions':[''], 'OKActions':[''], ... } </example> AlarmName: should be a combination of the metric name and the dimension ID (e.g., instance ID). The dimension ID, represented by $INSTANCE_ID, is a variable that must be provided by the user. If the metric does not require a dimension ID, generate a 6-digit random string and append it to the metric name. Dimensions: typically EC2 instance IDs, RDS instance IDs, etc. If the metric has a Dimensions field, the dimension value must be provided and cannot be fabricated. AlarmActions: Do not provide a value for the OKActions parameter. It should be left empty. OKActions: Do not provide a value for the OKActions parameter. It should be left empty. Other parameters: Refer to the CloudWatch put metric alarm API for additional parameters. </param> Do not use fictitious values like $variable, replace all variables with actual values </step3> <step4> Use the put-metric-alarm function to create CloudWatch alarm based on assembled parameters. If creation fails, guide the user to use the AWS CLI for troubleshooting. </step4> <step5> (Optional) Send metric test data to CloudWatch based on the newly created alert.This data should be a metric and its value, designed to exceed the alarm's threshold. For instance, if the alarm triggers at 70% CPU utilization, send a value like 80%. Consider the alarm's EvaluationPeriods setting; multiple data points might be needed if it's greater than 1. <param> Prepare the alarm parameters in JSON format, referring to the provided example. Pass the formatted JSON string to the put_metric_data parameter. <example> put_metric_data = { Namespace='AWS/EC2', MetricData=[ { 'MetricName': 'CPUUtilization', 'Dimensions': [ { 'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0' }, ], 'Value': 90, 'Unit': 'Percent' }, ] } </example> MetricData: metric data sent to CloudWatch MetricName: align with this used when creating the alarm Dimensions: align with those used when creating the alarm Other parameters: refer to the CloudWatch put metric data API for additional parameters. </param> If it fails, provide the AWS CLI command and recommend using the CLI for testing. </step5>
点击保存。
- 操作组:添加
-
选择创建新的Lambda函数,之后再修改这个函数。
-
在操作组中添加两个函数:
-
函数1: put-metric-alarm 用于创建CloudWatch Alarm告警
- 函数2: put-metric-data 用于发送CloudWatch 指标数据
点击创建操作组函数。
- 自动创建函数后,查看操作组Lambda函数,对该函数进行代码修改
- 代码参考如下:
import boto3 import json def put_metric_alarm(put_alarm_data): print('put-metric-alarm') cloudwatch = boto3.client('cloudwatch') alarm_data = json.loads(put_alarm_data) try: response = cloudwatch.put_metric_alarm(**alarm_data) print(response) return { 'statusCode': 200, 'msg': 'alarm created success' } except Exception as e: return { 'statusCode': 500, 'msg': f'alarm created fail, fail msg: {str(e)}' } def put_metric_data(put_metric_data): print('put_metric_data') cloudwatch = boto3.client('cloudwatch') metric_data = json.loads(put_metric_data) try: response = cloudwatch.put_metric_data(**metric_data) print(response) return { 'statusCode': 200, 'msg': 'send metric data success' } except Exception as e: return { 'statusCode': 500, 'msg': f'send metric data fail,fail msg: {str(e)}' } def lambda_handler(event, context): agent = event['agent'] actionGroup = event['actionGroup'] function = event['function'] parameters = event.get('parameters', []) print('event:{}'.format(event)) params = { 'put_alarm_data': '', 'put_metric_data': '' } for param in parameters: if param['name'] == 'put_alarm_data': params['put_alarm_data'] = param['value'] elif param['name'] == 'put_metric_data': params['put_metric_data'] = param['value'] print('params:{}'.format(params)) response = {} if function == 'put-metric-alarm': response = put_metric_alarm(params['put_alarm_data']) elif function == 'put-metric-data': response = put_metric_data(params['put_metric_data']) responseBody = { "TEXT": { "body": json.dumps(response) } } print(responseBody) action_response = { 'actionGroup': actionGroup, 'function': function, 'functionResponse': { 'responseBody': responseBody } } dummy_function_response = {'response': action_response, 'messageVersion': event['messageVersion']} print("Response: {}".format(dummy_function_response)) return dummy_function_response
代码修改后点击保存并deploy部署。
-
修改函数配置,超时设置为30s。
-
修改函数执行角色的权限,赋予cloudwatch权限
4.3测试Agent
- 点击‘在座席构建器中编辑’
- 确认座席中的信息是否正确,包括模型选择,座席说明,其他设置,操作组。没问题后 先点击‘保存’再点击‘准备’
- 在页面右侧进行测试
- 可以参考以下指令进行测试,Agent响应后请进行交互:
建议对lambda哪些指标进行告警? 创建lambda的errors告警 你建议对rds创建哪些告警? 这些告警对应的指标分别是什么? 创建rds数据库连接数告警 testrds1,连接数大于1000 创建ec2告警 创建ec2 i-0123456789 系统状态检查失败告警 发送测试数据 我希望这个告警的评估周期改成连续2分钟 这个创建告警类似的aws cli命令是什么? 创建ec2 状态检查失败的告警,只需要提供aws cli,不需要创建
- 客户也可以选择 streamlit 这样的UI工具测试
- 创建的告警可以在CloudWatch Alarm页面查看,并根据Agent自动生成的指标测试数据进行告警渠道消息查看
- 之后可以对自动创建的Alarm进行查看、编辑,以满足实际生产需求
5.总结
Bedrock Agent提供大语言模型推理和action能力,能够根据客户需求描述自动完成业务操作。本次演示了Bedrock Agent for CloudWatch Alarm,构建了一个智能告警助手,整个方案无服务器部署,降低了成本同时降低运维工作量,通过对成本预估,1美元可以自动完成200-400个告警配置。通过该工具,客户可以快速上手Alarm告警配置,生成告警后可以在控制台查看、编辑告警,使告警满足实际生产需求。
相关内容
- AWS 官方已更新 6 个月前
- AWS 官方已更新 4 个月前
- AWS 官方已更新 7 个月前
- AWS 官方已更新 2 年前