Why has ecs-cli logs started to hang unexpectedly?

0

I have been using a command line invocation like this

ecs-cli logs --task-id {task-id} --cluster foo --aws-profile foo

to retrieve logs from tasks that have been previously been started by aws ecs run-task and waited for with aws ecs wait tasks-stopped. The relevant code has been running for 2.5 years or so without significant issue and, in fact, is still running in one of our AWS accounts, but in another one, the ecs-cli logs command just hangs forever for no apparent reason. It used to run just fine in this account too, as recently as 2 weeks ago.

I can't find any reason why the this would be happening. The credentials are good. The tasks executes correctly in the failing environment and has stopped, but the ecs-cli logs command hangs. I am not specifying a --follow option on the logs command. The hand isis happening when ecs-cli is run run both on our build machine and our development laptops. These commands can run eccs-cli logs on tasks in a different account/vpc/cluster without error.

I am running ecs-cli v1.21.0 everywhere.

Any suggestions about how I can debug this further?

update: since first posting, I have learnt some more about the situation

  • the problem briefly (for a couple of days) stopped occurring. At the time, I thought the problem was occurring because of some misaligned configuration in an apparently unrelated component (cloudwatch-related configuration of API Gateway) and when I fixed that misalignment the problem stopped happening
  • however, the problem has since returned for reasons unknown and the configuration difference does not explain it as all the configuration is aligned with what it should be and how it is configured in other environments
  • the user concerned has the standard Administrator role, so it isn't a permissions problem. a similar user is configured in the same way in other environments and
  • it IS NOT a connectivity problem to the AWS services since I can execute the same scenarios in different accounts in the same region indicating that connectivity to the AWS endpoints is working (also see next point below - the request is getting to the API)
  • I get an error response from ecs-cli logs if I deliberately specify an incorrect task-id. If I specify a correct task-id, the command hangs for at least 10 minutes without generating any output
  • Can you specify the number log events in the Cloudwatch group? May be it is taking time to pull all the logs since you specified the application is running for 2.5 years!

asked 2 years ago761 views
2 Answers
0

This might be happening due to network configuration/nacl/sg. AWS CLI connects to AWS services via SSL endpoints by default and if your instance doesn't allow SSL(443) traffic out, this behavior would be observed.

AWS
Petro_K
answered 2 years ago
  • Thanks - I have updated the post above to show why I think it cannot be a SSL connectivity issue - the same client is able to execute the problematic scenario against other accounts in the same region - also, I can get an error response from the ecs-cli logs command if I mis-specify the task-id, so the request is clearly making it to the API.

0

Please make sure that your ECS task is using the awslogs driver and have a log stream prefix specified.

Also, check if the IAM user/role being used to run the ecs-cli has the following IAM actions allowed.

  • logs:FilterLogEvents
  • ecs:DescribeTasks
  • ecs:DescribeTaskDefinition

It is also worth checking if your outbound network connectivity to ecs.<region>.amazonaws.com and logs.<region>.amazonaws.com is intact.

As your issue seems to be account/cluster specific, it would be better to reach out to AWS Support.

If you have the above mentioned pre-requisites already in place, please open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago
  • Thanks for your reply.

    Yes the ECS task has awslogs configured. I know this because it used to work, it briefly worked for a couple of days after it stopped working and it does still work in other accounts and the relevant configuration is the same (we use terraform to manage out environments).

    The user does have all these permissions - the user actually as the Administrator role. Again, this user is configured in exactly the same manner as corresponding users in other AWS accounts where the problem does not occur.

    SSL connectivity is verified - the same scenario executes without error from the same client talking to the same AWS API endpoints in the same AWS region. Also, I do get an error response if the --task-id is mis-specified. Only if the --task-id is valid does it hang (for at least 10 minutes, probably indefinitely)

    I have reached out to AWS Support, but I figured I would post here just in case others run across the same issue.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions