Optimizing CloudWatch Container Insights Costs: A Comprehensive Guide

9 minute read
Content level: Expert
0

This guide focuses specifically on optimizing Container Insights application logs costs, which represent one component of your EKS cluster's observability costs. While a typical EKS deployment includes multiple logging components, we'll concentrate on Container Insights application logs optimization techniques while providing references for optimizing other components.

Table of Contents

Introduction

Amazon CloudWatch Container Insights helps you monitor and troubleshoot containerized applications and microservices. However, without proper optimization, costs can grow significantly. In this post, we'll explore how to analyze and optimize Container Insights costs effectively using a practical example.

Prerequisites

  • An existing EKS cluster with Container Insights enabled
  • AWS CLI configured with appropriate permissions
  • Basic understanding of Kubernetes and CloudWatch
  • Access to modify Fluent Bit configurations

Understanding Your Container Insights and EKS Logging Costs

Before diving into optimization strategies, it's crucial to understand your current cost distribution across different observability components.

To analyze your costs, first follow the setup instructions in Using AWS Cost and Usage Reports with Athena to create and query your Cost and Usage Reports.

Once set up, use this query to break down your Container Insights and EKS logging costs by purpose:

SELECT
    line_item_resource_id AS ResourceID,
    line_item_operation AS Operation,
    CASE 
        WHEN line_item_resource_id LIKE '%/aws/eks/%/core-containers%' 
            THEN 'Control Plane Logs'
        WHEN line_item_resource_id LIKE '%/aws/eks/%/containers%' 
            THEN 'Platform Container Logs'
        WHEN line_item_resource_id LIKE '%/aws/eks/%/cluster%' 
            THEN 'Cluster Level Logs'
        WHEN line_item_resource_id LIKE '%/aws/containerinsights/%/performance%' 
            THEN 'Container Insights Performance Metrics'
        WHEN line_item_resource_id LIKE '%/aws/containerinsights/%/prometheus%' 
            THEN 'Prometheus Metrics'
        WHEN line_item_resource_id LIKE '%/aws/containerinsights/%/application%' 
            THEN 'Container Insights Application Logs'
        WHEN line_item_resource_id = ''
            THEN 'EMF Metrics Storage'
        ELSE 'Other'
    END AS Purpose,
    SUM(CAST(line_item_unblended_cost AS decimal(16,8))) AS TotalSpend
FROM
    costandusagereport
WHERE
    product_product_name = 'AmazonCloudWatch'
    AND line_item_usage_account_id = '123456789123'  -- Replace with your account ID
    AND line_item_operation IN (
        'MetricStorage:AWS/Logs-EMF', -- Embedded Metrics
        'PutLogEvents',               -- Logs Ingestion
        'HourlyStorageMetering'       -- Logs Storage
    )
    AND line_item_line_item_type NOT IN ('Tax','Credit','Refund','EdpDiscount','Fee','RIFee')
    AND (
        line_item_resource_id LIKE '%log-group:/aws/containerinsights%' 
        OR line_item_resource_id LIKE '%log-group:/aws/eks%' 
        OR line_item_resource_id =''
    )
GROUP BY
    line_item_resource_id,
    line_item_operation
ORDER BY
    TotalSpend DESC

Example results showing typical cost patterns:

ResourceIDOperationPurposeTotalSpend
/aws/containerinsights/cluster-prod/applicationPutLogEventsApplication Logs450.36
MetricStorage:AWS/Logs-EMFPerformance Metrics266.7
/aws/eks/cluster-prod/containersPutLogEventsPlatform Logs131.49
/aws/containerinsights/cluster-prod/prometheusPutLogEventsPrometheus Metrics98.75
/aws/eks/cluster-prod/core-containersPutLogEventsControl Plane Logs45.6

Scope and Architecture

Kubernetes Cluster
├── CloudWatch Agent DaemonSet
│   ├── Performance Metrics
│   │   └── /aws/containerinsights/*/performance
│   │       (EMF metrics - essential monitoring, no optimization) [Article: No]
│   │
│   └── Prometheus Metrics
│       └── /aws/containerinsights/*/prometheus
│           (Custom metrics - review collection settings) [Article: No]
│
└── Fluent Bit DaemonSet
    ├── Container Insights Logs
    │   └── /aws/containerinsights/*/application
    │       (Primary optimization target - filtering, sampling) [Article: Yes]
    │
    └── EKS Logs
        ├── Platform Logs
        │   └── /aws/eks/*/containers
        │       (Highest cost - consider selective logging) [Article: No]
        │
        ├── Control Plane Logs
        │   └── /aws/eks/*/core-containers
        │       (Critical system logs - keep full logging) [Article: No]
        │
        └── Cluster Logs
            └── /aws/eks/*/cluster
                (Cluster-level events - keep full logging) [Article: No]

Based on the cost analysis from our Athena query, you can see various components contributing to your EKS observability costs. This guide focuses specifically on optimizing Container Insights application logs (marked as [Article: Yes] above).

For other components, including EKS platform and control plane logs optimization, see the observability cost optimization section of the Amazon EKS Best Practice Guide.

Our optimization techniques will focus on the Fluent Bit configuration for Container Insights application logs, where we can achieve significant cost savings (up to 96.5% log volume reduction) while maintaining observability.

Log Optimization Steps

Now that we understand which components we're targeting, let's explore four progressive optimization steps for Container Insights application logs using Fluent Bit configuration:

  1. Configure Log Filtering
  2. Implement Log Level Filtering
  3. Implement Log Sampling
  4. Optimize Batch Processing

Step 1: Configure Log Filtering

Purpose: Exclude logs from specific namespaces to reduce log volume.

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        K8S-Logging.Exclude On
        # Add these namespaces to exclude their logs
        K8S-Logging.Exclude_Namespaces kube-system monitoring ingress-nginx

Impact: Reduces log volume by excluding system and monitoring namespaces

  • Before: 50GB/month
  • After: 35GB/month (-30%)

Step 2: Implement Log Level Filtering

Purpose: Remove less critical log entries (INFO and DEBUG) to focus on important logs. Add this filter to your existing configuration:

[FILTER]
    Name    grep
    Match   kube.*
    # This will exclude logs containing INFO or DEBUG
    Exclude log INFO|DEBUG

Impact: Reduces remaining log volume by filtering out INFO and DEBUG logs

  • Before: 35GB/month
  • After: 17.5GB/month (-50%)

Step 3: Implement Log Sampling

Purpose: Sample a percentage of the remaining logs to further reduce volume while maintaining representative data.

Add this filter to your existing configuration:

[FILTER]
    Name        sample
    Match       kube.*
    Rate        10    # Sample 10% of logs

Impact: Reduces log volume by sampling only 10% of logs

  • Before: 17.5GB/month
  • After: 1.75GB/month (-90%)

Step 4: Optimize Batch Processing

Purpose: Optimize how logs are sent to CloudWatch to reduce API costs and improve performance. Add these parameters to your CloudWatch output configuration:

[OUTPUT]        
    Name                cloudwatch_logs
    Match               kube.*
    region             ${AWS_REGION}
    log_group_name     /aws/containerinsights/${CLUSTER_NAME}/application
    # Increase batch size to reduce API calls
    batch_size         10000        # Default is 1000
    # Wait longer to collect more logs in each batch
    batch_timeout      60           # Default is 30
    # Set retention period to manage storage costs
    retention_days     14           # Default is never expire

Impact: While this step doesn't directly reduce log volume, it:

  • Reduces API calls by batching more logs together
  • Manages storage costs through retention policies
  • Improves overall performance

Complete configuration example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        K8S-Logging.Exclude On
        K8S-Logging.Exclude_Namespaces kube-system monitoring ingress-nginx

    [FILTER]
        Name    grep
        Match   kube.*
        Exclude log INFO|DEBUG

    [FILTER]
        Name        sample
        Match       kube.*
        Rate        10

    [OUTPUT]
        Name                cloudwatch_logs
        Match               kube.*
        region             ${AWS_REGION}
        log_group_name     /aws/containerinsights/${CLUSTER_NAME}/application
        batch_size         10000
        batch_timeout      60
        retention_days     14

Cost Impact Analysis

Let's analyze the impact of these optimizations:

Cost Reduction Journey

Initial State ($450.36 - Application Logs PutLogEvents)
│
├── Step 1: Namespace Exclusion
│   └── Logs: -30% (-$135.11/month)
│   └── New cost: $315.25
│
├──├── Step 2: Log Level Filtering
│   └── Logs: -50% (-$157.62/month)
│   └── New cost: $157.63
│
├── Step 3: Sampling (10%)
│   └── Logs: -90% (-$141.87/month)
│   └── New cost: $15.76
│
└── Step 4: Batch Processing
    └── Reduced API costs through batching

Final Cost: $15.76/month for application logs
EMF Metrics: $266.70/month (unchanged)

Total Monthly Savings on Application Logs: $434.60 (96.5% reduction)

Note: EMF Metrics costs ($266.70/month) remain unchanged in our optimization scenario because these metrics are extracted from the /aws/containerinsights/*/performance logs, which are not the target of our optimization steps.

Monitoring and Maintenance

After implementing these optimizations, it's crucial to monitor both cost effectiveness and system observability.

Cost Monitoring

Monitor these CloudWatch metrics:

  • IncomingBytes and IncomingLogEvents for log volume
  • ResourceCount for Container Insights metrics

Observability Monitoring

To ensure optimization hasn't impacted your observability, monitor these aspects:

  1. Critical Event Detection
    1. Monitor incident detection time
    2. Track error visibility
    3. Verify critical error logging
  2. Application Health Monitoring
    1. Application performance metrics
    2. Service-level indicators (SLIs)
    3. Business event tracking
  3. Infrastructure Visibility
    1. Container restart monitoring
    2. Node health checks
    3. Service availability metrics

Example alert for monitoring logging effectiveness:

# Example: Alert if error logs drop significantly (might indicate logging issues)
aws cloudwatch put-metric-alarm \
    --alarm-name error-logs-missing \
    --metric-name IncomingLogEvents \
    --namespace AWS/Logs \
    --statistic Sum \
    --period 3600 \
    --threshold 10 \
    --comparison-operator LessThanThreshold \
    --evaluation-periods 1 \
    --alarm-actions ${SNS_TOPIC_ARN}

Best Practices and Recommendations

1. Implementation Strategy

  • Start with Non-Production
    • Test configurations in development environments first
    • Validate impact on troubleshooting capabilities
    • Document baseline metrics before changes
  • Gradual Implementation
    • Follow steps 1-4 in sequence
    • Allow time between changes to assess impact
    • Keep team informed of changes and expectations
  • Documentation
    • Record excluded namespaces and reasoning
    • Document logging levels for each application
    • Maintain change history and impact assessments

2. Ongoing Optimization

Regular Monitoring

Weekly Tasks:
├── Review error rates in sampled logs
├── Check for missing critical events
└── Validate batch processing performance

Monthly Tasks:
├── Analyze cost trends
├── Review namespace exclusions
└── Adjust sampling rates if needed

Quarterly Tasks:
├── Full cost-benefit analysis
├── Update retention policies
└── Review overall observability effectiveness

3. Observability Balance

  • Critical Systems
    • Keep ERROR and WARN logs unsampled
    • Maintain full logging for security events
    • Consider separate log groups for critical components
  • Environment-Specific Settings
Production:
└── Conservative sampling (25-50%)
└── Retain all error logs
└── Full metrics collection

Staging:
└── Moderate sampling (10-25%)
└── Basic error logging
└── Selected metrics

Development:
└── Aggressive sampling (5-10%)
└── Minimal logging
└── Limited metrics

Storage Class Considerations

  • Container Insights components (logs and EMF metrics) require Standard storage class
  • IA tier storage is not supported for Container Insights

Conclusion

Our systematic approach to Container Insights optimization demonstrated significant cost savings while maintaining observability:

Key Achievements

Cost Reduction:
├── Total savings: $434.60/month (96.5%)
├── Application logs cost: $450.36 → $15.76
└── Log volume reduction: 96.5%

Maintained Capabilities:
├── Critical error detection
├── Performance monitoring
└── System troubleshooting

While performance metrics constitute a significant cost ($266.70/month), our log optimization strategy provided substantial savings with minimal operational impact. The key is finding the right balance between cost optimization and maintaining effective system observability.

Next Steps

  1. Implement monitoring dashboards
  2. Establish regular review cycles
  3. Document optimization results
  4. Plan for continuous improvement

Resources

Official Documentation

Additional Resources

Tools and Scripts

profile pictureAWS
EXPERT
published 2 months ago312 views