Skip to content

Bug: ALBs, Lambdas, HTTPCode_ELB_5XX_Count

0

The crux: When a Lambda (reached via an ALB) successfully executes but returns a 5XX status code, the ALB counts this as part of the HTTPCode_ELB_5XX_Count CloudWatch metric (under AWS/ApplicationELB namespace). But this clearly should be counted in HTTPCode_Target_5XX_Count instead.

Background

We had a lambda that was failing to catch an exception and this was counted both in the Errors metric (under AWS/Lambdas namespace) and in HTTPCode_ELB_5XX_Count. This is all acceptable. We saw the errors in our dashboard for that lambda and were able to explain the extra ELB 5XX rate in the ALB dashboard. [It would be nice if HTTPCode_ELB_5XX_Count had a dimension for TargetGroup since clearly the TargetGroup matters here, but that would be a feature request and I'd like to focus on the bug.]

We fixed the problem that was causing the exception and also fixed the lambda to catch the exception and return the exact 5XX status code we wanted. But there was a second failure case that we didn't fix (the exception for which was caught so the Lambda successfully returned a 5XX status code).

Our Lambda dashboard now reports 100% success rate but we still see an elevated rate in HTTPCode_ELB_5XX_Count. Since ELB_5XX has no dimension for TargetGroup, and Lambdas have no effective metrics for 5XX responses, we are left just searching around to figure out where the problem is.

Reasons To Fix

If successful 5XX responses from Lambdas were correctly recorded under HTTPStatus_Target_5XX_Count, then we could add this metric (using the TargetGroup dimension) to the dashboards for our Lambdas. Then we would immediately know which logs to look at to diagnose the problem.

It would also prevent our ALB dashboard from appearing as if we had a backend problem that was preventing the ALB from being able to send requests to the appropriate, healthy target group, when this is clearly not actually what is happening. Since "no available target" is a serious condition that we actually page people for, we really don't want to be doing this based on something that is clearly not "no available target". (Most of the traffic to this ALB ends up routed to a backend running in EKS while only a few URLs are routed to Lambdas.)

1 Answer
0

Greeting

Hi Tye,

Thanks for sharing this detailed explanation of your experience! It’s clear you’ve been thorough in identifying where AWS metrics seem to misalign, and I can understand how this would create challenges in diagnosing backend issues and keeping your dashboards actionable. Let’s break this down and find a way to clarify and improve your monitoring setup. 😊


Clarifying the Issue

You’ve observed that HTTPCode_ELB_5XX_Count includes 5XX status codes returned by Lambda functions when invoked through an ALB. This creates ambiguity in your dashboards, as these successful Lambda responses appear as if the ALB itself is experiencing backend availability problems. Even though the Lambda successfully executes and returns a 5XX response, it’s still lumped into the ELB metric, making it difficult to trace and diagnose issues effectively.

AWS’s design logic behind this behavior is that the ALB essentially acts as a proxy. Any response returned to the ALB (whether from a backend or directly generated by the ALB itself) contributes to HTTPCode_ELB_5XX_Count. Unfortunately, this doesn’t distinguish between backend-originated errors and ALB-specific issues, creating confusion for users.

Your goal is to have these successful 5XX Lambda responses logged under HTTPCode_Target_5XX_Count, which would provide greater clarity and allow you to build more specific dashboards to monitor and address Lambda-specific errors. This is an excellent goal, as it aligns with the principle of actionable monitoring and reduces noise in your metrics!


Key Terms

  • ALB (Application Load Balancer): AWS's load balancer that distributes traffic across targets such as Lambda functions, EC2 instances, and containers.
  • HTTPCode_ELB_5XX_Count: A CloudWatch metric recording 5XX errors generated by the ALB itself or forwarded from backend responses.
  • HTTPCode_Target_5XX_Count: A CloudWatch metric for 5XX errors generated by backend targets, such as Lambda functions or EC2 instances, allowing isolation of target-specific issues.
  • TargetGroup Dimension: A filter for metrics to isolate traffic directed to a specific target group behind an ALB.

The Solution (Our Recipe)

Steps at a Glance:

  1. Configure custom logging for your Lambda function to explicitly record 5XX responses.
  2. Use structured logging to push these events to CloudWatch Logs.
  3. Create a custom metric filter in CloudWatch to capture Lambda 5XX responses.
  4. Add the custom metric to your Lambda-specific dashboards for monitoring.
  5. Monitor and manage costs associated with custom metrics and X-Ray.

Step-by-Step Guide:

  1. Configure Custom Logging for Your Lambda Function
    Update your Lambda function code to explicitly log a structured message whenever a 5XX response is returned. This will help you trace and attribute these responses accurately.

    import json
    
    def lambda_handler(event, context):
        try:
            # Example business logic
            raise Exception("Simulated backend error")
        except Exception as e:
            # Log and return a 5XX status
            log_message = {
                "statusCode": 502,
                "error": str(e),
                "message": "Lambda caught an exception and returned 5XX"
            }
            print(json.dumps(log_message))  # Log structured data for CloudWatch Metric Filters
            return {
                "statusCode": 502,
                "body": json.dumps({"error": "Backend error"})
            }

  1. Use Structured Logging to Push Events to CloudWatch Logs
    Ensure that the structured logs are being sent to CloudWatch. If you are using a default Lambda setup, these logs will already appear in your Lambda's CloudWatch Logs group.

  1. Create a Custom Metric Filter in CloudWatch
    In the AWS Management Console:

    • Navigate to CloudWatch > Log Groups and select your Lambda’s log group.
    • Create a metric filter with a pattern that matches 5XX responses. Example:
      { $.statusCode = 502 }
      
    • Assign this filter to a custom metric, such as AppName_TargetGroupName_5XX_Count.

    Pro Tip: Use consistent naming conventions like AppName_TargetGroupName_5XX_Count to maintain clarity across multiple applications or target groups.


  1. Add the Custom Metric to Your Dashboards
    • Navigate to CloudWatch > Dashboards and add the custom metric.
    • Filter the metric by Lambda function name or other dimensions to isolate the data for your analysis.

  1. Monitor and Manage Costs
    • Custom Metrics: Be mindful that custom metrics incur additional charges. For high-throughput applications, consider using filters to track only critical errors or aggregate metrics to reduce costs.
    • AWS X-Ray: If using X-Ray, costs can grow with heavy traffic or complex tracing setups. Review the X-Ray pricing guide and fine-tune sampling rates to balance costs with visibility.

Closing Thoughts

By implementing custom metrics for Lambda 5XX responses, you gain greater clarity into backend issues without relying solely on ALB metrics like HTTPCode_ELB_5XX_Count. This approach complements AWS’s existing monitoring tools and helps you build actionable dashboards tailored to your architecture.

For additional guidance:


Farewell

I hope this clears up the ambiguity in your metrics, Tye! I can see how this is particularly important with your dual ALB setup for Lambda and EKS. Let me know if you have any other questions or need help setting up these custom metrics. Wishing you success with your monitoring dashboards! 🚀😊


Cheers,

Aaron! 😊

answered a year ago
  • Thank you very much for the detailed response.

    One comment. You wrote:

    | Any response returned to the ALB (whether from a backend or directly generated by the ALB itself) contributes to HTTPCode_ELB_5XX_Count.

    This is true for Lambda backends but not for others. It would be good for Lambda backends to use HTTPCode_Target_5XX_Count as the others do. This consistency would alleviate AWS users from having to do elaborate work-arounds. So I still see the odd behavior of only Lambdas here to be hard to see as not just a bug.

    My work-around was not quite as elaborate as yours. I summed the AWS/ApplicationELB LambdaUserErrors over all TargetGroups and subtracted that from the HTTPCode_ELB_5XX_Count metric in my dashboards and alarms. For dashboards, you can use SUM(SEARCH('{AWS/ApplicationELB,LoadBalancer,TargetGroup} MetricName="LambdaUserError" LoadBalancer="$lb"', 'Sum')). For alarms, you have to use: SELECT SUM(LambdaUserError) FROM "AWS/Lambda" WHERE LoadBalancer = '$lb' GROUP BY LoadBalancer (which means the graph of the alarm's history can be inaccurate as recently as 30m ago).

    Granted, wanting to get the other HTTPCode_Target_[234]XX_Count metrics for your Lambdas may lead to an approach similar to what you outlined anyway. We will likely get those using Prometheus instead as the cost management is unlikely to become an issue that way.

    So it would also be great for AWS customers if the ALB recorded 2XX, 3XX, and 4XX from Lambdas in these metrics like other backends.

  • Hi Tye,

    Thanks for the follow-up! You’re absolutely right—recording Lambda 5XX responses under HTTPCode_Target_5XX_Count, as with other backends, would create much-needed consistency and simplify monitoring. It’s hard not to see this as a bug or at least a design quirk.

    Your workaround using LambdaUserError metrics is both clever and practical—thanks for sharing the exact syntax for dashboards and alarms! Prometheus also sounds like a strong option for detailed metrics without significant cost concerns.

    Extending ALB metrics to include HTTPCode_Target_2XX_Count, 3XX, and 4XX for Lambdas would align things further and reduce the need for such workarounds. Raising this through AWS feedback channels might encourage action if enough users highlight the issue.

    Let me know if you need help with Prometheus or refining your setup! 😊

    Best regards,
    Aaron 🚀

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.