Skip to content

Value of TargetResponseTime for AWS ECS Service Connect Service seems incorrect

0

Hello, I have two ECS services connected as ECS Service Connect Services to each-other. Users are making requests to service A and service A is making requests to service B. The request processing of service B takes around 16 seconds which I can verify both in application logs of B and by request time of A. However, the metric TargetResponseTime with TargetDiscoveryName dimension of service B shows value of 30000ms always, which is just not correct. Am I doing something wrong? Missing some configuration? Or is this expected and the metric should not be relied on?

3 Answers
0

1. Check ECS Service Connect Configuration

Ensure that your ECS Service Connect configuration is correctly set up. Double-check the following:

Service Discovery: Verify that service discovery is properly configured and that service A can correctly resolve service B.

Service Connect Definition: Ensure that the services are correctly defined and connected in your Service Connect configuration.

2. Validate Metric Collection

Sometimes, discrepancies in metrics might be due to a delay or issue in metric collection. Make sure:

CloudWatch Metrics: Look for any anomalies or gaps in CloudWatch metrics that might indicate a problem with metric collection.

Metric Aggregation: Ensure that the metrics are not being aggregated in a way that could cause confusion (e.g., if you’re aggregating over longer time periods).

3. Inspect Application and Load Balancer Logs

Check the application logs of both services and any load balancer logs that might be in use:

Service Logs: Verify the request and response times in the logs of service B to ensure they align with the 16-second processing time you expect.

Load Balancer Logs: If you're using a load balancer in front of your ECS services, check its logs to see if it provides additional insights into request handling times.

4. Review ECS Task Definitions and Health Checks

Ensure that:

Task Definitions: The task definitions for both services are correctly configured and up-to-date.

Health Checks: Verify that the health checks for your services are correctly configured and that they do not interfere with the response time metrics.

5. Cross-Verify with Other Metrics

Sometimes cross-verifying with other metrics or logs can provide additional insights:

AWS X-Ray: If you’re using AWS X-Ray, it can help trace requests across services and provide more granular timing information. Custom Metrics: Consider adding custom metrics or logs to your application code to provide more precise timing information.

EXPERT
answered a year ago
    1. Check ECS Service Connect Configuration

    All requests from service A to service B succeed, their responses are as expected and service A runs completely correctly.

    1. Validate Metric Collection

    Any aggregation and/or statistic I try to apply to the TargetResponseTime metric always showns 30000ms. I have a synthetic canary set up against the service A which runs every 5 minutes. These are the requests which take ~16s but the entries in the metric correlating with these request say 30s.

    1. Inspect Application and Load Balancer Logs

    Application logs of service A and B say 16-17 seconds for these requests. ALB in front of service A has in it's Load balancer access logs target_processing_time of ~17 seconds.

    1. Review ECS Task Definitions and Health Checks

    As far as I understand it, container health checks for service B are run directly against the container, not through the ECS Service Connect Proxy and as such should not in any way contribute to this metric.

    1. Cross-Verify with Other Metrics

    Unfortunately I am not using X-Ray for these service.

    One additional piece of information that might be relevant, is that there are other services similar to service B that service A is calling. Their requests are much faster than service B's but their TargetResponseTime metric also has curious values. Those are always rounded to some 10s of ms (50ms, 100ms, 250ms) and they never fluctuate they just remain constant (even when inspecting Minimum statistic over 1s)

  • can you check once below I've posted another answer

0

It sounds like you have thoroughly investigated several aspects of your ECS Service Connect setup and are still encountering issues with the TargetResponseTime metric being consistently reported as 30 seconds, despite other indicators suggesting a response time closer to 16 seconds.

Given your findings, let's consider a few additional avenues to explore:

1. Metric Collection and Reporting Issue

Since the TargetResponseTime metric consistently shows 30,000 ms regardless of actual request times and aggregation attempts, it's possible there might be an issue with how this metric is being reported or collected. Here are a few things you might consider:

CloudWatch Metric Data Delay: There could be a delay in metric reporting or aggregation. Although unusual, CloudWatch metrics sometimes experience delays. If you haven’t already, try waiting a bit longer and re-checking the metrics.

Metric Source: Ensure that the metric source is correctly configured and there are no misconfigurations causing incorrect reporting. Sometimes, discrepancies can arise from how the metric is aggregated or reported.

2. ECS Service Connect and Target Response Time

Review the ECS Service Connect documentation and configuration to ensure that the metric being observed is indeed the one intended. Service Connect might have specific nuances in how metrics are reported:

Service Connect Metrics Documentation: Check AWS documentation for any known issues or peculiarities with TargetResponseTime metrics for ECS Service Connect. There might be notes or known issues that could explain the behavior you're seeing.

Service Configuration: Re-check the configuration of the Service Connect proxy. Sometimes, incorrect configurations can lead to unexpected behavior in metrics.

3. Metric Interpretation and Calibration

There might be an issue with how the metric is interpreted:

Metric Calibration: If possible, attempt to calibrate or validate the metric by running controlled tests or synthetic workloads to see if the metric behavior aligns with expectations.

Compare with Other Metrics: Although TargetResponseTime might be unreliable, look at other related metrics (like RequestLatency or TargetProcessingTime) for consistency. If these metrics show correct values, it might suggest an issue specific to TargetResponseTime.

EXPERT
answered a year ago
    1. Metric Collection and Reporting Issue

    The service B is already running for almost a day and the metric is the same. The other services I mentioned are already running for a few months and their metrics are also the same.

    I'm not sure what you mean by metric source, the metric is reported by Service Connect Proxy and according to docs the metric is:

    The latency of the application request processing. The time elapsed, in
    milliseconds, after the request reached the Service Connect proxy in the target
    task until a response from the target application is received back to the proxy.
    

    https://docs.aws.amazon.com/AmazonECS/latest/developerguide/available-metrics.html

    1. ECS Service Connect and Target Response Time

    As mentioned above the docs about the metric say it is time in milliseconds between when request arrives to the Service connect proxy until the proxy gets response from the target container. I see no notes specific to this metric explaining my issues.

    1. Metric Interpretation and Calibration

    As mentioned in my response above, I have periodic synthetic workloads in place and also I tried manually inducing the workload between services A and B with exact same results.

    There are no metrics RequestLatency or TargetProcessingTime you mention anywhere in AWS documentation. target_processing_time is a value emitted by a Load Balancer and the value for these requests for it is ~17 seconds also as mentioned above.

0

The constant 30,000ms reported by the TargetResponseTime metric in AWS ECS Service Connect may be due to default timeout settings, network overhead, or Service Connect-specific behavior. This metric might not fully reflect the actual application processing time, which could explain the discrepancy with the 16-second response seen in logs.

It’s recommended to verify timeout settings, consider network factors, and possibly use alternative metrics for more accurate performance monitoring. If the issue persists, contacting AWS support may help clarify the behavior.

EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.