By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Request for Assistance in Understanding and Managing 700,000 AWS CloudWatch Metrics

0

I am reaching out for assistance regarding a large volume of metrics in our AWS CloudWatch account, totaling approximately 700,000 metrics. We are currently unsure of the origin of many of these metrics and would greatly appreciate guidance in identifying their sources and managing this volume effectively.

Could you please assist with the following:

Source Identification: We need assistance in understanding where these metrics are coming from. Is there a way to analyze or break down the sources of these metrics to help us identify which services, instances, or processes are generating them?

Data Retention and Archiving: Once we have a better understanding of the source, are there best practices for adjusting data retention policies or archiving metrics to help manage storage and costs effectively?

Cost Optimization: Given the unexpectedly high volume, any advice on cost optimization for CloudWatch would be invaluable. Are there specific steps or configurations that could help reduce costs associated with this number of metrics?

Alternative Storage Solutions: If we determine that many of these metrics are non-essential, are there alternative AWS services or approaches to store and analyze this data more cost-effectively, particularly for metrics that do not require real-time access?

Any guidance, documentation, or support in understanding and managing these metrics would be greatly appreciated.

3 Answers
0

Normally, the names and namespaces of the metrics are the best indication of their sources. Metrics produced by AWS's standard services follow a systematic naming convention. Custom metrics might be named in whichever way, but in most cases, when the effort is made to produce custom metrics, due consideration is given to making them useful by structuring them with proper naming.

If you browse the namespace structure and the individual metrics underneath them simply in the CloudWatch metrics console, is there a small number of namespaces under which the bulk of the 700,000 metrics are located? Or is there a huge number of namespaces? If there is only a handful of namespaces, do their names reveal anything about their source and purpose? If the number of namespaces is large, is there a pattern to them, such as essentially the same namespace being duplicated for a large number of servers or applications?

EXPERT
answered a month ago
profile picture
EXPERT
reviewed a month ago
0

Hi, building on what the experts already shared, I’d like to draw your attention to a possible help in analyzing your metrics and provide a few more specific answers to your questions.

Source identification: as pointed out by the experts, your metrics are grouped by namespace. They pointed out a question, that maybe you have a specific namespace with a high concentration of metrics. This may happen for example when you have a very high number of dimension values, as every dimension value creates a distinct metric. If that were your case, or if you need a way to quickly drill down in a huge block of metrics in your account, have you seen the following blog post? https://aws.amazon.com/blogs/mt/analyzing-your-custom-metrics-spend-contributors-in-amazon-cloudwatch/ - the solution in that blog post is designed explicitly to find out immediately the biggest contributors in number of metrics.

Data retention: The metrics retention is fixed, there’s a 15 month retention. A metric disappears from the UI list after not having been sent data to for 2 weeks, but the metric data remains queryable using the API or by specifying manually the name of the metric in the source of a dashboard widget. However please note that the retention does not induce any new costs. You only need archiving if you want to keep metrics data for more than 15 months.

Cost: you incur spend only when you ingest metrics values. The way to reduce costs is to identify the source and what’s driving a high number of metrics. If it’s a dimension that has too many values, and you don’t need that level of granularity, you need to change the way the metric is emitted to reduce the number of dimensions. An example is if you emit a metric with 4xx error count per API endpoint and per session - the session id would have a lot of values, creating a lot of metrics. Maybe you don’t need that level of details as a metric, and you only need a metric with the 4xx error count per API endpoint - by removing the session id from the dimensions when you send the metrics data, you would greatly reduce the number of metrics and drive down costs. If you are creating a metric using a metric filter with a dimension whose values come from the logs and that’s too many values, you could either use a less verbose dimension or create a Contributor Insight rule to analyze the data instead of creating a metric. Those are just two examples, I hope you find them useful, or the blog post can help you. There’s also a dedicated documentation page at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_billing.html

profile pictureAWS
answered a month ago
profile picture
EXPERT
reviewed 15 days ago
-1

Certainly, I can help you address the challenges you're facing with your large volume of AWS CloudWatch metrics. Let's break this down into the areas you've mentioned:

  1. Source Identification: To identify the sources of your 700,000 metrics, you can use AWS Cost Explorer and AWS Cost and Usage Reports integrated with Amazon Athena. These tools can help you analyze your CloudWatch usage data and break down the metrics by service, instance, or process. You can also use the CloudWatch console to view metrics by namespace, which can give you insights into which AWS services are generating the most metrics.

  2. Data Retention and Archiving: For managing data retention and archiving, consider adjusting your retention settings in CloudWatch. By default, CloudWatch Metrics are stored for 15 months, but you can modify this. For metrics that don't require long-term storage, reducing the retention period can help manage costs. For metrics that need to be kept for compliance or long-term analysis, consider using CloudWatch Logs Insights or exporting the data to Amazon S3 for more cost-effective long-term storage.

  3. Cost Optimization: To optimize costs with such a high volume of metrics:

  • Remove unused or unnecessary metrics and alarms.
  • Use the CloudWatch embedded metric format instead of calling the PutMetricData API directly, which can be more cost-effective for high-cardinality data.
  • Batch your data when sending metrics to CloudWatch.
  • Avoid using high-cardinality dimensions in your metrics.
  • Consider using metric streams to send only necessary metrics to other AWS services or external monitoring tools.
  1. Alternative Storage Solutions: For non-essential metrics or those that don't require real-time access:
  • Use the CloudWatch embedded metric format to ingest high-cardinality application data as logs. This allows you to generate actionable metrics while reducing costs associated with the PutMetricData API.
  • Consider using Amazon S3 for long-term storage of historical metric data. You can use AWS Glue and Amazon Athena to query and analyze this data when needed.
  • For custom-developed applications, work with your developers to implement important metrics using the CloudWatch embedded metric format, which can help reduce costs and simplify metric collection, especially for Lambda functions and containers.

Remember to regularly review and optimize your CloudWatch usage. Set up proper tagging for your resources to make it easier to identify and manage metrics associated with specific applications or teams. Also, consider using AWS Systems Manager for automating the deployment and management of the CloudWatch agent across your EC2 instances and on-premises servers.

By implementing these strategies, you should be able to better understand the sources of your metrics, manage your data more effectively, optimize costs, and find appropriate storage solutions for your CloudWatch metrics.
Sources
Analyzing, optimizing, and reducing CloudWatch costs - Amazon CloudWatch
Observability - AWS Cloud Adoption Framework: Operations Perspective
Application-level CloudWatch configuration - AWS Prescriptive Guidance
PERF02-BP03 Collect compute-related metrics - AWS Well-Architected Framework

profile picture
answered a month ago
  • You can’t manually delete metrics even though the AI says so. 15 months they are retained for

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions