ECS Agent not reporting metrics

0

We're not seeing ECS Metrics from our EC2 based ECS clusters in the ca-central-1 region.

The instances that back this cluster have the AmazonEC2ContainerServiceforEC2Role managed policy on them which includes the ecs:StartTelemetrySession action for resource *.

Looking at the cluster itself, I see CloudWatch metrics is marked as Default.

We aren't using Container Metrics, but I would still expect the at least the basic ECS CloudWatch metrics without that.

Looking at the /var/log/ecs/ecs-agent.log file on one of the cluster instances, i see these logs (sanitized to remove account id, and service specific info)

level=info time=2023-10-03T22:54:24Z msg="Using cached DiscoverPollEndpoint" endpoint="https://ecs-a-2.ca-central-1.amazonaws.com/" telemetryEndpoint="https://ecs-t-2.ca-central-1.amazonaws.com/" serviceConnectEndpoint="https://ecs-sc.ca-central-1.api.aws" containerInstanceARN="arn:aws:ecs:ca-central-1:************:container-instance/****/****"
level=info time=2023-10-03T22:54:24Z msg="Establishing a Websocket connection" url="https://ecs-t-2.ca-central-1.amazonaws.com/ws?agentHash=********&agentVersion=1.70.0&cluster=arn%3Aaws%3Aecs%3Aca-central-1%3A************%3Acluster%2F****&containerInstance=arn%3Aaws%3Aecs%3Aca-central-1%3A************%3Acontainer-instance%2F****%2F****&dockerVersion=20.10.17"
level=debug time=2023-10-03T22:54:24Z msg="Established a Websocket connection to https://ecs-t-2.ca-central-1.amazonaws.com/ws?agentHash=********&agentVersion=1.70.0&cluster=arn%3Aaws%3Aecs%3Aca-central-1%3A************%3Acluster%2F****&containerInstance=arn%3Aaws%3Aecs%3Aca-central-1%3A************%3Acontainer-instance%2F****%2F****&dockerVersion=20.10.17" module=client.go
level=info time=2023-10-03T22:54:24Z msg="Connected to TCS endpoint" module=handler.go
level=debug time=2023-10-03T22:54:24Z msg="TCS client starting websocket poll loop" module=client.go
level=debug time=2023-10-03T22:54:24Z msg="Received message of type: AckPublishHealth" module=client.go
level=debug time=2023-10-03T22:54:24Z msg="Received ACKPublishHealth from tcs" module=handler.go
level=debug time=2023-10-03T22:54:24Z msg="Received message of type: AckPublishMetric" module=client.go
level=debug time=2023-10-03T22:54:24Z msg="Received AckPublishMetric from tcs" module=handler.go
level=debug time=2023-10-03T22:54:24Z msg="Error getting message from ws backend: error: [websocket: close 1008 (policy violation): InvalidContent : Unexpected text frame received], messageType: [-1] " module=client.go
level=debug time=2023-10-03T22:54:24Z msg="Unsubscribing event handler TCSDeregisterContainerInstanceHandler from event stream DeregisterContainerInstance" module=eventstream.go
level=error time=2023-10-03T22:54:24Z msg="Error: lost websocket connection with ECS Telemetry service (TCS): websocket: close 1008 (policy violation): InvalidContent : Unexpected text frame received" module=handler.go
level=debug time=2023-10-03T22:54:25Z msg="Storage stats not reported for container" module=utils_unix.go
level=debug time=2023-10-03T22:54:26Z msg="Handling http requestmethodGETfrom172.17.0.2:43620" module=logging_handler.go
level=debug time=2023-10-03T22:54:28Z msg="Handling http requestmethodGETfrom172.17.0.2:42818" module=logging_handler.go
level=debug time=2023-10-03T22:54:30Z msg="Received message of type: HeartbeatMessage" module=client.go
level=debug time=2023-10-03T22:54:30Z msg="ACS activity occurred" module=acs_handler.go

So it looks like it is attempting to connect to the service, but it's possibly getting some unexpected text, or a policy violation. As mentioned above, this isn't the lack of ecs:StartTelemetrySession, as that's included in the AmazonEC2ContainerServiceforEC2Role.

I did come across a GitHub issue for the ECS Agent that also had the policy violation error. In their specific case, they also reached out to Amazon Support and were told

This has currently been identified as a service level issue where the connection to TCS endpoint fails while running new tasks using the new arn format. Due to this connection failure, the cluster's CPU and memory metrics are not populating successfully. The ECS service team is aware of this shortcoming and is actively working towards resolution.

I wanted to be double check on this issue and tried running a task with the new arn and I see the same error on my end as well. While the service team is working for a fix, the only workaround at this point in time is to opt-out from the New ARN/ID format but it will require to recreate the instances/tasks to take effect.

But that was back in November of 2018, and the new ARN format was supposed to become the default in April 2021.

Environment Details

$ sudo docker version
Client:
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.18.6
 Git commit:        100c701
 Built:             Sat Dec  3 04:13:49 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.6
  Git commit:       a89b842
  Built:            Sat Dec  3 04:14:27 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.7
  GitCommit:        f19387a6bec4944c770f7668ab51c4348d9c2f38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.7G     0  7.7G   0% /dev
tmpfs           7.7G     0  7.7G   0% /dev/shm
tmpfs           7.7G  808K  7.7G   1% /run
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/nvme0n1p1  128G   54G   75G  42% /
tmpfs           1.6G     0  1.6G   0% /run/user/1000
ecs-agent
  version=1.70.0 
  commit=28ac48dc
asked 7 months ago86 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions