How does Cloudwatch latency metric get calculated/ Did I find a bug in X-Ray Traces?

0

We have this really odd request latency problem that may have to do with something infrastructurally on our egress servers located on premises, but we're not sure. Looking at x-ray traces, we'll see a large number of requests take 1-20 minutes to get satisfied. This show up on the cloudwatch x-ray trace duration column when looking at the table of traces. When I click on a particular trace associated with some super long, definitely-not-just-a-cold-start duration, the total duration billed duration on the actual handlers rarely exceeds 100ms.

I found Cloudwatch API Gateway latency metrics for the p99 request latencies that seem to correlate with the durations we are seeing in that trace id table. In order to propose a good theory based on the findings I've found so far I'd like to know a few things.

I am looking for a precise definition of what specifically triggers the start and stop times:

  1. How does the cloudwatch latency metric for API Gateway get calculated?
  2. How does the X-Ray trace table calculate duration for the duration of a lambda execution?
  3. How does the duration data get calculated for the view of a singular trace id? (I have conclusive evidence that this is different than the duration used in the X-Ray trace table view)
1 Answer
1

Thank you for reaching out. I believe you are observing discrepancy in cloudwatch latency metric in the x-ray trace data v/s the billed duration and you have queries below. Though the reason for the issue could vary, it would be early to comment on the cause or to mention it a bug. Please find the answers below and feel free to create a support case with premium support team, if needed.

How does the cloudwatch latency metric for API Gateway get calculated?

The time between when API Gateway receives a request from a client and when it returns a response to the client. The latency includes the integration latency and other API Gateway overhead. Unit: Millisecond [+] https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-metrics-and-dimensions.html

How does the X-Ray trace table calculate duration for the duration of a lambda execution?

To note here, If your Lambda function uses provisioned concurrency, your X-Ray trace might display a function initialization with a very long duration. [+] https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html [+] https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html

Lambda records a segment for the Lambda service that handles the invocation request, and one for the work done by the function. The function segment comes with subsegments for Initialization, Invocation and Overhead

The Initialization subsegment represents the init phase of the Lambda execution environment lifecycle. During this phase, Lambda creates or unfreezes an execution environment with the resources you have configured, downloads the function code and all layers, initializes extensions, initializes the runtime, and runs the function's initialization code.The Invocation subsegment represents the invoke phase where Lambda invokes the function handler. This begins with runtime and extension registration and it ends when the runtime is ready to send the response.

The Overhead subsegment represents the phase that occurs between the time when the runtime sends the response and the signal for the next invoke. During this time, the runtime finishes all tasks related to an invoke and prepares to freeze the sandbox.
[+] https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html#services-xray-api

How does the duration data get calculated for the view of a singular trace id? (I have conclusive evidence that this is different than the duration used in the X-Ray trace table view)

I believe you are referring the duration calculated for the trace at singular point in time. The same is calculated from the difference between start_time and end_time

start_time – number that is the time the segment was created, in floating point seconds in epoch time. For example, 1480615200.010 or 1.480615200010E9. Use as many decimal places as you need. Microsecond resolution is recommended when available.
end_time – number that is the time the segment was closed. For example, 1480615200.090 or 1.480615200090E9. Specify either an end_time or in_progress.

[+] https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html#api-segmentdocuments-fields
AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions