By using AWS re:Post, you agree to the Terms of Use
/Amazon CloudWatch/

Questions tagged with Amazon CloudWatch

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

1
answers
0
votes
15
views
asked 16 days ago

Manual remediation config works, automatic remediation config fails

SOLVED! There was a syntax problem in the runbook, that is not detected when manually remediating. In the content of the remediation doc (that was created using Cloudformation), I used a parameter declaration: parameters: InstanceID: type: 'AWS::EC2::Instance::Id' It should be: parameters: InstanceID: type: String ===================================================================================== I have a remediation runbook that creates Cloudwatch alarms for the metric 'CPUUtilization' for any EC2 instances that have none defined. The runbook is configured as a remediation document for a config rule that checks for the absence of such alarms. When I configure the remediation on the rule as manual, all goes well. When I configure the remediation with the exact same runbook as automatic, the remediation fails with this error (snippet): "StepDetails": [ { "Name": "Initialization", "State": "FAILED", "ErrorMessage": "Invalid Automation document content for Create-CloudWatch-Alarm-EC2-CPUUtilization", "StartTime": "2022-05-09T17:30:02.361000+02:00", "StopTime": "2022-05-09T17:30:02.361000+02:00" } ], This is the remediation configuration for the automatic remediation. The only difference with the manual remediation configuration is obviously the value for key "Automatic" being "false" { "RemediationConfigurations": [ { "ConfigRuleName": "rul-ensure-cloudwatch-alarm-ec2-cpuutilization-exists", "TargetType": "SSM_DOCUMENT", "TargetId": "Create-CloudWatch-Alarm-EC2-CPUUtilization", "TargetVersion": "$DEFAULT", "Parameters": { "AutomationAssumeRole": { "StaticValue": { "Values": [ "arn:aws:iam::123456789012:role/rol_ssm_full_access_to_cloudwatch" ] } }, "ComparisonOperator": { "StaticValue": { "Values": [ "GreaterThanThreshold" ] } }, "InstanceID": { "ResourceValue": { "Value": "RESOURCE_ID" } }, "Period": { "StaticValue": { "Values": [ "300" ] } }, "Statistic": { "StaticValue": { "Values": [ "Average" ] } }, "Threshold": { "StaticValue": { "Values": [ "10" ] } } }, "Automatic": true, "MaximumAutomaticAttempts": 5, "RetryAttemptSeconds": 60, "Arn": "arn:aws:config:eu-west-2:123456789012:remediation-configuration/rul-ensure-cloudwatch-alarm-ec2-cpuutilization-exists/5e3a81a7-fc55-4cbe-ad75-6b27be8da79a" } ] } The error message is rather cryptic, I can't find documentation on possible root causes. Any suggestions would be very welcome! Thanks!
1
answers
0
votes
10
views
asked 20 days ago

Possible to editing alarm name and delete created metrics in Cloudwatch

Hi! I'm trying to do a bit of clean up in CloudWatch, but I run into some problem. I'm still a newbie at Clouwatch so I'm hoping there are some easy answers to my questions. Having said this, I'm quite surprised why what I try to achieve seems to be hard. **1. Is it not possible to edit an alarm name?** When I go into edit mode of an already created alarm, I can edit the Metric name that the alarm is based on, but not the actual alarm name itself (step 3: Add name and description). That input box for the name is greyed out, i.e. readonly. Since all alarms appear in a list and there is no possibility as far as I know to have tree view hierarchy in this list, you do want to group alarm names in order to get some kind of overview and hierachy to find an alarm you are loking for. To me, it seems like the choice has been made under the CloudWatch hood to make the name of the alarm act as the programming identifier, i.e. not editable. Why? **So my question boils down to: is it only possible to edit the alarm name by deleting the alarm and recreating it?** **2. Some metrics seem impossible to delete** When I was new to CloudWatch I accidentally created several different namespaces for the same namespace (mistyped). So I need to clean up and move some metrics from one namespace to another. To move a metric from one namespace to another does not seem possible, so I recreated the metric with the correct namespace name. But now I find that it is impossible (?) to delete the old metric with the old incorrect namespace. If I use the menu option "Metrics -> All metrics" I'll get an overview of all namespaces under the browse tab. But there is no way I can see to delete neither the metric, nor the faulty namespace they reside in in this view. How do I achieve this? If I go into the main LogGroups view, there is a column "Metric Filters" in the grid that shows a link to the metrics attached to each LogGroup. But the metrics with faulty namespaces do not appear anywhere which may be because the old log group they were based on may have been deleted. If so I guess they were deattached from the parent LogGroup and no longer accessible? So why were they not deleted when the LogGroup was deleted (cascading delete) and why is it not possible (?) to delete a parentless LogGroup if so?
3
answers
0
votes
9
views
asked 23 days ago

AWS Lambda@Edge created using AWS CDK doesn't put Log to CloudWatch

I created a simple Lambda@Edge function like below. ``` 'use strict'; exports.handler = async function(event, context, callback) { const cf = event.Records[0].cf; console.log('Record: ', JSON.stringify(cf, null, 2)); console.log('Context: ', JSON.stringify(context, null, 2)); console.log('Request: ', JSON.stringify(cf.request, null, 2)); callback(null, cf.request); } ``` And I deployed it using AWS CDKv2 `experimental EdgeFunction like below ``` const edgeFunction = new cloudfront.experimental.EdgeFunction(this, 'EdgeFunction', { runtime: Runtime.NODEJS_14_X, handler: 'index.handler', code: Code.fromAsset(path.join(__dirname, '../../../../lambda/ssr2')), }); ``` and also I set it up as edge function for a Distribution ``` const distribution = new Distribution(this, 'Distribution', { defaultBehavior: { origin, cachePolicy: CachePolicy.CACHING_DISABLED, viewerProtocolPolicy: ViewerProtocolPolicy.REDIRECT_TO_HTTPS, edgeLambdas: [ { functionVersion: edgeFunction.currentVersion, eventType: LambdaEdgeEventType.VIEWER_REQUEST, } ] }, ``` But when I tried sending the request to the Distribution, the log didn't show up anything. I checked the permission, the role already has permission ``` Allow: logs:CreateLogGroup Allow: logs:CreateLogStream Allow: logs:PutLogEvents ``` I expect the function write logs to the CloudWatch. What did I miss? **UPDATE 1** Below is the role document, ``` { "sdkResponseMetadata": null, "sdkHttpMetadata": null, "partial": false, "permissionsBoundary": null, "policies": [ { "arn": "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole", "document": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" } ] }, "id": "ANPAJNCQGXC425412345", "name": "AWSLambdaBasicExecutionRole", "type": "managed" } ], "resources": { "logs": { "service": { "icon": "", "name": "Amazon CloudWatch Logs" }, "statements": [ { "action": "logs:CreateLogGroup", "effect": "Allow", "resource": "*", "service": "logs", "source": { "index": "0", "policyName": "AWSLambdaBasicExecutionRole", "policyType": "managed" } }, { "action": "logs:CreateLogStream", "effect": "Allow", "resource": "*", "service": "logs", "source": { "index": "0", "policyName": "AWSLambdaBasicExecutionRole", "policyType": "managed" } }, { "action": "logs:PutLogEvents", "effect": "Allow", "resource": "*", "service": "logs", "source": { "index": "0", "policyName": "AWSLambdaBasicExecutionRole", "policyType": "managed" } } ] } }, "roleName": "MyProject-EdgeFunctionFnServiceRoleC7B72E4-1DV3AZXP558ZS", "trustedEntities": [ "lambda.amazonaws.com", "edgelambda.amazonaws.com" ] } ``` I just tried using the Test in the Lambda Panel. All the tests send logs to the CloudWatch. However when I send request to the CloudFront, it didn't send anything. **UPDATE 2** I just found out from StackOverflows that the log is being stored not centrally but distributed to regions. Something like below ``` /aws/lambda/us-east-1.MyProject-EdgeFunctionFn44308ADF-loJeFwXXzTOm ``` So instead of opening it from Lambda panel, I need to open it in the CloudFront panel. Somewhat I couldn't find it in any AWS documentations. **References** https://aws.amazon.com/id/blogs/networking-and-content-delivery/aggregating-lambdaedge-logs/ https://stackoverflow.com/questions/66949758/serverless-aws-lambdaedge-how-to-debug#:~:text=Go%20to%20CloudWatch%20and%20search,%2D%3E%20Lambda%40Edge%20Errors%20.
2
answers
0
votes
28
views
asked a month ago

ec2tagger: Unable to describe ec2 tags for initial retrieval: AuthFailure: AWS was not able to validate the provided access credentials / cloudwatch log agent, vpc endpoints

I got error: "ec2tagger: Unable to describe ec2 tags for initial retrieval: AuthFailure: AWS was not able to validate the provided access credentials" in cloudwatch log agent on an ec2 instance that has: 1. CloudWatchAgentServerRole -- this is default AWS managed role attached to the instance, this default role already allow ""ec2:DescribeTags"," in its policy. <---- NOTE this 2. Its NACL allowed all outbound and allowed all vpc's CIDR network range inbound 3. Cloudwatch log agent config file's region is correct 4. telnet ec2.us-east-2.amazonaws.com 443 or telnet monitoring.us-east-2.amazonaws.com 443 or telnet logs.us-east-2.amazonaws.com 443 under the ec2 instance all return successful connection (Connected <..> Escape character is '^]') I also create three vpc endpoints: logs (com.amazonaws.us-east-2.logs), monitoring (com.amazonaws.us-east-2.monitoring), ec2 (com.amazonaws.us-east-2.ec2) interface endpoints. They have SG that allowed all VPC's CIDR network range inbound. The idea is to expose metrics to cloudwatch via vpc endpoints. Despite all above setup, I can't make cloudwatch agent to work and it keeps echo above error complain about credentials is not valid even though the REGION in config file is correct and traffic between instance and cloudwatch is allowed.
1
answers
0
votes
55
views
asked a month ago

GG logs are pushed intermittently to cloudwatch

Hello, We are using GreenGrass v2 and would like to have logs pushed as frequently as possible. We understand the limitation for certain components (eg telemetry) but our application logs should be sent as near real time as we can. Currently, regardless of the configurations set our logs are only sent intermittently to CloudWatch. Logs are seen every few hours or even few days. Could anyone please help us understand whats happening? We are using the following configuration: ``` 'aws.greengrass.Nucleus': { componentVersion: '2.4.0', configurationUpdate: { merge: `{ "logging" : { "level" : "DEBUG", "format" : "JSON" } }`, }, }, 'aws.greengrass.LogManager': { componentVersion: '2.2.1', configurationUpdate: { merge: `{ "logsUploaderConfiguration": { "systemLogsConfiguration": { "uploadToCloudWatch": "true", "minimumLogLevel": "DEBUG", "diskSpaceLimit": "2", "diskSpaceLimitUnit": "KB", "deleteLogFileAfterCloudUpload": "true" }, "componentLogsConfigurationMap": { "com.component1": { "minimumLogLevel": "DEBUG", "diskSpaceLimit": "2", "logFileDirectoryPath": "/greengrass/v2/logs/", "logFileRegex": "com.component1\\\\w*.log", "diskSpaceLimitUnit": "KB", "deleteLogFileAfterCloudUpload": "true" }, "com.component2": { "minimumLogLevel": "DEBUG", "diskSpaceLimit": "2", "logFileDirectoryPath": "/greengrass/v2/logs/", "logFileRegex": "com.component2\\\\w*.log", "diskSpaceLimitUnit": "KB", "deleteLogFileAfterCloudUpload": "true" }, "aws.greengrass.SageMakerEdgeManager": { "minimumLogLevel": "DEBUG", "logFileDirectoryPath": "/greengrass/v2/logs/", "logFileRegex": "aws.greengrass.SageMakerEdgeManager\\\\w*.log", "diskSpaceLimit": "2", "diskSpaceLimitUnit": "KB", "deleteLogFileAfterCloudUpload": "true" }, "aws.greengrass.SecureTunneling": { "minimumLogLevel": "DEBUG", "logFileDirectoryPath": "/greengrass/v2/logs/", "logFileRegex": "aws.greengrass.SecureTunneling\\\\w*.log", "diskSpaceLimit": "2", "diskSpaceLimitUnit": "KB", "deleteLogFileAfterCloudUpload": "true" } } }, "periodicUploadIntervalSec": "10" }`, }, }, ```
3
answers
0
votes
9
views
asked a month ago

AWS SDK SQS get number of messages in a dead letter queue

Hello community, I somehow can't find the right information. I have following simple task to solve: create a lambda that checks if a dead letter queue has messages and if it has, read how many. Before I did that I had an alarm set on an SQS metric. I chose the 'ApproximateNumberOfMessagesVisible' metric since 'NumberOfMessagesSent' (which was my first choice) does not work for DLQueues. I have read this article: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html. >The NumberOfMessagesSent and NumberOfMessagesReceived for a dead-letter queue don't match > If you send a message to a dead-letter queue manually, it is captured by the NumberOfMessagesSent metric. However, if a message is sent to a dead-letter queue as a result of a failed processing attempt, it isn't captured by this metric. Thus, it is possible for the values of **NumberOfMessagesSent** and NumberOfMessagesReceived to be different. That is nice to know, but I was missing the information: which metric shall I use if **NumberOfMessagesSent** won't work? I was being pragmatic here so I created an error, a message was sent to the DLQ as a result of a failed processing attempt. Now I looked at the queue in the AWS console under the monitoring-tab and I checked which metric spiked. It was **ApproximateNumberOfMessagesVisible**, which sounded suitable, so I used it. Now I wanted to get alerted more often so I chose to build a lambda function that checks how many messages are in the DLQueue. I use Javascript / Typescript so I found this: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_GetQueueAttributes.html. Code looked something like this: ``` const params = { QueueUrl: url, AttributeNames: ['ApproximateNumberOfMessagesVisible'] } const resp = SQS.getQueueAttributes(params).promise() ``` It was kind of a bummer that the attribute I wanted was not in there, or better: it was not valid. > Valid Values: All | Policy | VisibilityTimeout | MaximumMessageSize | MessageRetentionPeriod | ApproximateNumberOfMessages | ApproximateNumberOfMessagesNotVisible | CreatedTimestamp | LastModifiedTimestamp | QueueArn | ApproximateNumberOfMessagesDelayed | DelaySeconds | ReceiveMessageWaitTimeSeconds | RedrivePolicy | FifoQueue | ContentBasedDeduplication | ... My first attempt was to use CloudWatch metrics. So I tried this: https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/cloudwatch-examples-getting-metrics.html ``` var params = { Dimensions: [ { Name: 'LogGroupName', /* required */ }, ], MetricName: 'IncomingLogEvents', Namespace: 'AWS/Logs' }; cw.listMetrics(params, function(err, data) { if (err) { console.log("Error", err); } else { console.log("Metrics", JSON.stringify(data.Metrics)); } }); ``` but I could not get this working since I did not know what to add to Dimensions / Name to make this working. Please note that I am not working very long with AWS (only 6 months). Maybe I am on a total wrong track. Summarized: I want to achieve that my lambda gets the number of messages in a DLQ. I hope someone can help me Cheers Aleks
1
answers
0
votes
8
views
asked 2 months ago

Cloudformation - Autoscaling: how to get the summary (not average) of the metrics from all nodes?

I set my treshold to scale-up when cpu usage is 80% and scale-in when there is below 70% of usage. And the problem is that (AFAIK) for autoscaling group the average value is taken. Why its a problem? Example situation: 1. There is one node, i make 100% cpu load 2. Alarm is triggered, another instance is created 3. Now metric is divided by 2 so `(100% + 0%) / 2 = 50%` which is below 70% -> scale-in alarm is triggered and even though one node is still loaded with 100%, one node is being destroyed. Ideally for scale down i would use not average but SUMMARY of all loads on the nodes. There is `AWS::CloudWatch::Alarm/Properties/Statistic` settings with average or sum values but these are for Evaluation periods, not for ammount of factors in given dimension? https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-cw-alarm.html#cfn-cloudwatch-alarms-statistic my template ``` { "AWSTemplateFormatVersion":"2010-09-09", "Description" : "Creates Autoscaling group. Used securitygroup ids and subnets ids are hardcoded.", "Parameters" : { "myprojectAmiId": { "Description": "New AMI ID which will be used to create/update autoscaling group", "Type": "AWS::EC2::Image::Id" }, "myprojectNodesDefaultQuantity":{ "Type": "Number", "MinValue" : "1" } }, "Resources" : { "myprojectLaunchTemplate":{ "Type":"AWS::EC2::LaunchTemplate", "Properties":{ "LaunchTemplateData":{ "IamInstanceProfile":{ "Arn": "arn:aws:iam::censored6:instance-profile/myproject-ec2" }, "ImageId": { "Ref":"myprojectAmiId" }, "InstanceType" : "t3a.small", "KeyName" : "my-ssh-key", "SecurityGroupIds" : [ "sg-censored", "sg-censored", "sg-censored5", "sg-censored" ] } } }, "myprojectAutoScalingGroup": { "Type":"AWS::AutoScaling::AutoScalingGroup", "UpdatePolicy" : { "AutoScalingRollingUpdate" : { "MaxBatchSize" : "1", "MinInstancesInService" : "1", "PauseTime" : "PT5M", "WaitOnResourceSignals": "true" } }, "Properties": { "MinSize":{ "Ref":"myprojectNodesDefaultQuantity" }, "MaxSize":"3", "HealthCheckGracePeriod":300, "LaunchTemplate": { "LaunchTemplateId": { "Ref":"myprojectLaunchTemplate" }, "Version":{ "Fn::GetAtt":[ "myprojectLaunchTemplate", "LatestVersionNumber" ] } }, "VPCZoneIdentifier" : [ "subnet-censored", "subnet-0censoredc" ], "TargetGroupARNs" : [ "arn:aws:elasticloadbalancing:us-west-2:censored:targetgroup/autoscaling-tests-targetgroup/censored" ], "Tags" : [ {"Key" : "Name", "Value" : "myproject-cloudformation-ascaling-tests", "PropagateAtLaunch" : true}, {"Key" : "Stack", "Value" : "dev-staging","PropagateAtLaunch" : true}, {"Key" : "CreatedBy", "Value" : "cloudformation", "PropagateAtLaunch" : true} ] } }, "myprojectScaleUpPolicy":{ "Type" : "AWS::AutoScaling::ScalingPolicy", "Properties" : { "AdjustmentType" : "ChangeInCapacity", "AutoScalingGroupName" : { "Ref" : "myprojectAutoScalingGroup" }, "Cooldown" : "60", "ScalingAdjustment" : 1 } }, "myprojectScaleDownPolicy":{ "Type" : "AWS::AutoScaling::ScalingPolicy", "Properties" : { "AdjustmentType" : "ChangeInCapacity", "AutoScalingGroupName" : { "Ref" : "myprojectAutoScalingGroup" }, "Cooldown" : "60", "ScalingAdjustment" : -1 } }, "myprojectCPUAlarmHigh": { "Type" : "AWS::CloudWatch::Alarm", "Properties" : { "AlarmActions" : [ { "Ref" : "myprojectScaleUpPolicy" } ], "AlarmDescription" : "Scale-up if CPU > 80% for 5 minutes", "ComparisonOperator" : "GreaterThanThreshold", "Dimensions" : [ { "Name": "AutoScalingGroupName", "Value": { "Ref" : "myprojectAutoScalingGroup" }} ], "EvaluationPeriods" : 2, "MetricName" : "CPUUtilization", "Namespace" : "AWS/EC2", "Period" : 30, "Statistic" : "Average", "Threshold" : 80 } }, "myprojectCPUAlarmLow": { "Type" : "AWS::CloudWatch::Alarm", "Properties" : { "AlarmActions" : [ { "Ref" : "myprojectScaleDownPolicy" } ], "AlarmDescription" : "Scale-down if CPU < 70% for 10 minutes", "ComparisonOperator" : "LessThanThreshold", "Dimensions" : [ { "Name": "AutoScalingGroupName", "Value": { "Ref" : "myprojectAutoScalingGroup" }} ], "EvaluationPeriods" : 2, "MetricName" : "CPUUtilization", "Namespace" : "AWS/EC2", "Period" : 600, "Statistic" : "Average", "Threshold" : 70 } } } } ```
0
answers
0
votes
5
views
asked 2 months ago

EFS performance/cost optimization

We have a relatively small EFS of about 20G in burst mode, it was setup about 2 months ago and there were not much performance issue, utilization are always under 2% even under our max load usage (only for a very short period of time) And yesterday, we suddenly noticed that our site are not responding, but our server have very minimal CPU loads. Then we saw that the utilization of the EFS suddenly went up to 100%, digging deeper, it seems that we had been slowing and consistently consuming the original 2.3T BurstCreditBalance for the past few weeks, and it went to zero yesterday. Problems 1. The EFS monitoring tab provided completely useless information and does NOT even include the report of BurstCreditBalance, we had to find it in CloudWatch ourselves. 2. The Throughput utilization is misleading that we are actually slowly using up the credits, but there are no indications of such 3. We had since switched to Provisioned mode at 10MBps in the meantime as we're not really sure how to get the correct throughput number we need for our system. CloudWatch is showing 1s average max value of MeteredIOBytes 7.3k, DataReadIOBytes 770k, DataWriteIOBytes 780k. 4. we're seeing BurstCreditBalance build up much quicker (w 10MBps Provisioned) than we had used previously (in Burst). However, when we switched to 2MBps Provisioned, our system is visibly throttled even though there are 1T BurstCreditBalance, why? Main questions 1. How to properly define a Provisioned rate that is not too excessive, but not limiting our system when it needs to use it based on the CloudWatch metrics? 2. Ideally, we'd like to use Burst as that fits better, but with just 20GB, we don't seem to accumulate any BurstCreditBalance
1
answers
0
votes
9
views
asked 2 months ago

How to invoke a private REST API (created with AWS Gateway) endpoint from an EventBusRule?

I have setup the following workflow: - private REST API with sources `/POST/event` and `/POST/process` - a `VPCLink` to an `NLB` (which points to an `ALB` pointing to a microservice running on `EKS`) - a `VPC endpoint` with DNS name `vpce-<id>-<id>.execute-api.eu-central-1.vpce.amazonaws.com` with `Private DNS enabled` - an EventBridge `EventBus` with a rule that has two targets: 1 `API Destination` for debugging/testing and 1 `AWS Service` which points to my private REST Api on the source `/POST/process` - all required `Resource Policies` and `Roles` - all resources are defined within the same AWS Account The **designed** worflow is as follows: - invoke `POST/event` on the VPC endpoint (any other invocation is prohibited by the `Resource Policy`) with an `event` payload - the API puts the `event` payload to the `EventBus` - the `EventBusRule` is triggered and sends the `event` payload to the `POST/process` endpoint on the private REST API - the `POST/process` endpoint proxies the payload to a microservice running on EKS (via `VPCLink` > `NLB` > `ALB`> `k8s Service`) **What does work** so far: - invoking `POST/event` on the VPC endpoint - putting the `event` payload to the `EventBus` - forwarding the `event` payload to the `API Destination` set up for testing/debugging (it's a temporary endpoint on https://webhook.site) - testing the `POST/event` and `POST/process` integration in the AWS Console (the latter is verified by checking that the `event` payload reaches the microservice on EKS successfully) That is all single steps in the workflow seem to work, and all permissions seem to be set properly. **Whad does not work **is invoking the `POST/process` endpoint from the `EventBusRule`, i.e. invoking `POST/event` does not invoke `POST/process` via the `EventBus`, _although_ the `EventBusRule` was triggered. So my **question** is: **How to invoke a private REST API endpoint from an EventBusRule?** **What I have already tried:** - change the order of the `EventBusRule targets` - create a Route 53 record pointing to the `VPC endpoint` and treat it as an (external) `API Destination` - allow access from _anywhere_ by _anyone_ to the REST API (temporarily only, of course) **Remark on the design:** I create _two_ endpoints (one for receiving an `event`, one for processing it) with an EventBus in between because - I have to expect a delay of several minutes between the `Event Creation/Notification` and the successful `Event Processing` - I expect several hundred `event sources`, which are different AWS and Azure accounts - I want to keep track of all events that _reach_ our API and of their successful _processing_ in one central EventBus and _not_ inside each AWS account where the event stems from - I want to keep track each _failed_ event processing in the same central EventBus with only one central DeadLetterQueue
1
answers
0
votes
12
views
asked 2 months ago

Lambda logging to CloudWatch seems to be broken?

For some reason, one of my Lambda's logs are no longer appearing on CloudWatch. I see output from a Test run in the Lambda's screen but the CloudWatch list of logs is persistently empty. After some flailing around, I decided to try creating a new Lambda with the same code and configuration. Test ran. But when I hit "Click here to view the corresponding CloudWatch log group.", It opened CloudWatch looking at the expected log group name -- with a big red warning that this group did not exist. Clicking "(Logs)" at the top of the test output gave the same behavior. This is surprising; I thought I remembered that creating a Lambda created its log groups automagically...? I tried creating the group manually, but now I'm back where I was -- the lambda runs, I get local log output, but the CloudWatch Log Group for my lambda still shows no Log Streams. I checked the CloudWatch configuration, and it does list both the old and new Lambda's ARNs as being allowed to create and write to log streams... My oldest lambda (the one my Alexa skill uses directly) is still apparently writing to CloudWatch successfully. I am very confused. I'm a relatively new user, and I'm willing to believe this is user error -- but I have no idea what that error might be. Any advice folks can offer on fixing this would be tremendously appreciated, especially since my Skill just went live and the failing lambda is the one that is triggered by an EventBridge cron job to update the database in the background. It does seem to be running OK for now -- but I need logs if I'm ever going to have to debug it again, and I need to understand why a new copy of that Lambda is having the same problem. Programmer's mantra: "If it was easy, they wouldn't need _us_..."
2
answers
0
votes
23
views
asked 2 months ago

Training Metric logging on SageMaker experiment tracking: how to get time-series metrics with visualisation

I am using the sagemaker python SDK to train a bespoke model. I have defined my `metric_definition` regexes and passed them to the estimator like: ```python num_re = "([0-9\\.]+)(e-?[[01][0-9])?" metrics = [ {"Name": "learning-rate", "Regex": f"lr: {num_re}"}, {"Name": "training:loss", "Regex": f"loss: {num_re}"}, # ... ] estimator = Estimator( image_uri=training_image_uri, # ... metric_definitions=metrics, enable_sagemaker_metrics=True, ) ``` When I run training, these metrics are visible in my logs and I can also see them in SageMaker Studio in `Trial Components > Metrics (tab)` as a grid of numbers like: > Name | Minimum | Maximum | Standard Deviation | Average | Count | Final value > learning-rate | 8.889 | 8.907 | 0.010392304845413657 | 8.898 | 4 |8.907 > ... Which suggests that the regexes are correctly matching on the logs However, I am not able to visualise any graphs for my metrics. I have tried all of: - `Sagemaker Studio > Trial components > charts`. It is only possible to plot things like `learning-rate_min` (i.e. a point value not a time-series metric) - `SageMaker aws console > training > training jobs > <select job> > Scroll to Monitor section`. Here I can see metrics like CPUUtilization over time but for my metrics there is just an empty graph for each metric that I have defined that says 'No data available' - `SageMaker aws console > training > training jobs > <select job> > Scroll to Monitor section > View algorithm metrics (opens in CloudWatch) > Browse > select metric (e.g. learning-rate and 'Add to Graph' `. I filter by the correct time period and go the `Graphed metrics (1) tab`, even after updating the period to `1 second` I am not able to see anything on the graph. I'm not sure what the issue is here but any help would be much appreciated
2
answers
0
votes
43
views
asked 2 months ago

Set cpu and memory requirements for a Fargate AWS Batch job from an AWS Cloudwatch event

I am trying to automate Fargate AWS Batch jobs by means of AWS Cloudwatch Events. So far, so good. I am trying to run the same job definition with different configurations. I am able to set the batch job as a cloudwatch event target. I have learned how to use the Constant (JSON text) configuration to set a parameter of the job. Thus, I can set the name parameter successfully and the job runs. However, I am not able to also set the memory and cpu settings in the Cloudwatch event. I would like to use a larger machine for a a bigger port such as Singapore, without changing the job definition. After all, at the moment it still uses the default vpcu and memory settings of the job definition. ``` { "Parameters": {"name":"wilhelmshaven"}, "ContainerOverrides": { "Command": ["upload_to_day.py", "-port_name","Ref::name"], "resourceRequirements": [ {"type": "MEMORY", "value": "4096"}, {"type": "VCPU", "value": "2"} ] } } ``` Does any one know how to set the Constant (JSON text) configuration or input transformer correctly? Edit: If I try the same thing using the AWS CLI, I can achieve what I would like to do. ``` aws batch submit-job \ --job-name "run-wilhelmshaven" \ --job-queue "arn:aws:batch:eu-central-1:123666072061:job-queue/upload-raw-to-day-vtexplorer" \ --job-definition "arn:aws:batch:eu-central-1:123666072061:job-definition/upload-to-day:2" \ --container-overrides '{"command": ["upload_to_day.py", "-port_name","wilhelmshaven"], "resourceRequirements": [{"value": "2", "type": "VCPU"}, {"value": "4096", "type": "MEMORY"}]}' ```
1
answers
0
votes
8
views
asked 2 months ago

Proper conversion of AWS Log Insights to Metrics for visualization and monitoring

TL;DR; ---- What is the proper way to create a metric so that it generates reliable information about the log insights? What is desired ------ The current Log insights can be seen similar to the following [![AWS Log insights][1]][1] However, it becomes easier to analyse these logs using the metrics (mostly because you can have multiple sources of data in the same plot and even perform math operations between them). Solution according to docs ----- Allegedly, a log can be converted to a metric filter following a guide like [this][2]. However, this approach does not seem to work entirely right (I guess because of the time frames that have to be imposed in the metric plots), providing incorrect information, for example: [![Dashboard][3]][3] Issue with solution ----- In the previous image I've created a dashboard containing the metric count (the number 7), corresponding to the sum of events each 5 minutes. Also I've added a preview of the log insight corresponding to the information used to create the event. However, as it can be seen, the number of logs is 4, but the event count displays 7. Changing the time frame in the metric generates other types of issues (e.g., selecting a very small time frame like 1 sec won't retrieve any data, or a slightly smaller time frame will now provide another wrong number: 3, when there are 4 logs, for example). P.S. ----- I've also tried converting the log insights to metrics using [this lambda function][4] as suggested by [Danil Smirnov][5] to no avail, as it seems to generate the same issues. [1]: https://i.stack.imgur.com/0pPdp.png [2]: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CountingLogEventsExample.html [3]: https://i.stack.imgur.com/Dy5td.png [4]: https://serverlessrepo.aws.amazon.com/#!/applications/arn:aws:serverlessrepo:us-east-1:085576722239:applications~logs-insights-to-metric [5]: https://blog.smirnov.la/cloudwatch-logs-insights-to-metrics-a2d197aac379
0
answers
0
votes
3
views
asked 3 months ago

Content Localization workflow fails at the end of translation

Hello everybody, we've successfully deployed the Content Localization service with the CloudFormation template. When we upload a video in the Content Localization Web App, the following error is raised: ``` { "error": "ValueError", "cause": { "errorMessage": "Exception: 'Stage Translate encountered and error during execution, aborting the workflow'", "errorType": "ValueError", "stackTrace": [ " File \"/var/task/app.py\", line 315, in complete_stage_execution_lambda\n return complete_stage_execution(\"lambda\", event[\"Name\"], event[\"Status\"], event[\"Outputs\"], event[\"WorkflowExecutionId\"])\n", " File \"/var/task/app.py\", line 460, in complete_stage_execution\n raise ValueError(\n" ] } } ``` The error is raised at the end of the translation. GraphInspector shows the stage "Complete Stage Translate" in red color. CloudWatch logs show the same error and don't provide additional info. The marked comments in the code shortly before the ValueError is raised indicate that might be a known issue (lines 437-460 in `/var/task/app.py`): ``` ########### SEE THIS COMMENT: # Start the next stage for execution # FIXME - try always completing stage # status == awsmie.STAGE_STATUS_COMPLETE: ############ KNOWN ISSUE? workflow_execution = start_next_stage_execution( "Workflow", stage_name, workflow_execution) if status == awsmie.STAGE_STATUS_ERROR: raise Exception("Stage {} encountered and error during execution, aborting the workflow".format(stage_name)) except Exception as e: logger.info("Exception {}".format(e)) # Need a try/catch here? Try to save the status execution_table.put_item(Item=workflow_execution) update_workflow_execution_status(workflow_execution["Id"], awsmie.WORKFLOW_STATUS_ERROR, "Exception while rolling up stage status {}".format(e)) logger.info("Exception {}".format(e)) raise ValueError( "Exception: '%s'" % e) ``` Do you have any idea on how to solve this issue? Thank you very much in advance! Best regards
0
answers
0
votes
3
views
asked 3 months ago

LightSail instance down every 2 or 3 days

I signed up an AWS 1GB LightSail plan less than two months ago with WordPress installed, did a test run for a few weeks, and everything seems fine. So I moved my WordPress website there about two weeks ago. Ever since then the instance just keeps down every two or three days. https://imgur.com/ewaGlJU This graph shows my CPU usage and remaining burst capacity. The CPU usage stays inside the sustainable zone most of the time, the remaining burst capacity stays at 100%, and the dips are caused by rebooting. When the instance is down (The site is not accessible, and I can't SSH to it via putty or web either), I have to stop and start the instance, then the remaining capacity drops to 20% and slowly climbs to 100% and stays there, till the next cycle. https://imgur.com/TnHTNun This is the last 24 hours memory usage chart. During the normal time, the memory usage is below 40%. The high sparks are caused by me when I remotely connected to it via Visual Studio Code. When my site was down (reported down time at 5:50am CST), the memory usage dropped to 27% (at the end of chart around 12:00UTC). https://imgur.com/OVwRieu This morning after I received the site down message (around 5:55am CST), I tried to connect to the instance. The remaining burst capacity stayed at 100%, CPU usage was around 1%, and memory usage was around 27%. Around 6:40am, (the CPU usage is still 1%, burst capacity 100%, memory usage 27% at the time), I gave up, so I stopped and started the instance. When the instance was up around 6:45am, the CPU usage was around 2%, burst capacity dropped to 20%, memory usage jumped to 35%. Can anyone help me to figure out what's going on here? and what are other ways to troubleshot this problem? Thanks
1
answers
0
votes
33
views
asked 3 months ago
  • 1
  • 90 / page