This is a copy of a question I asked earlier on Stack Overflow. Hoping maybe I can get some useful responses here. Edits: formatting.
I have a node.js application running in docker, deployed to an Elastic Beanstalk cluster via ECS. This application has two environments, call them "stage" and "prod". Both environments are configured to stream (non-custom) instance logs to cloudwatch with identical security policies in place. Log streaming works correctly in one environment ("stage") while the other ("prod") does not stream to cloudwatch (groups and streams are created but no events are ever written) and logs instead get written to disk on each EC2 instance.
I have verified the following are true for both environments:
-
Both environments are in the same region (us-east-1
)
-
Identical platform and version (Docker on Amazon Linux 2/3.0.0).
-
The Instance log streaming to CloudWatch Logs
option enabled in the Software
section of the configuration tab on the EB web console
-
Identical settings for Retention
(3 days) and Lifecycle
(Delete logs upon termination).
-
Code deployed (a public-facing GraphQL API if that matters) which writes a lot of logging output to the console via console.debug
, console.info
and friends.
-
Custom Service Role
set on the Security
section of the EB console's configuration tab. Both service roles resolve to the IAM role set as the instance profile.
-
Custom IAM Instance Profile
IAM roles with the identical permission, trust relationships, and permission policies as below:
Trusted entities
The identity provider(s) ec2.amazonaws.com
The identity provider(s) elasticbeanstalk.amazonaws.com
Condition Key Value
StringEquals sts:ExternalId elasticbeanstalk
Permissions policies
AmazonEC2ContainerRegistryReadOnly
AWSElasticBeanstalkEnhancedHealth
AWSElasticBeanstalkWebTier
AWSElasticBeanstalkMulticontainerDocker
AmazonEC2ContainerRegistryPowerUser
AWSElasticBeanstalkWorkerTier
sns-topic-publish-allow-policy
cloudwatch-allow-policy
AWSElasticBeanstalkManagedUpdatesCustomerRolePolicy
cloudwatch-allow-policy policy document:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups",
"logs:CreateLogStream"
],
"Resource": "*"
}
]
}
Both environments otherwise run correctly, sit at Green/OK Health status, and report no permission problems. Differences are that 'stage' is not load balanced or scaled and runs on a smaller instance size. Prod has load balancing and scaling (which I'm assuming is irrelevant but I can share details on that if it is).
Expected behavior - stage
When the application deployed to the stage
environment writes something to the console, it appears as an event in a Cloudwatch stream named /aws/elasticbeanstalk/stage/var/log/eb-docker/containers/eb-current-app/stdouterr.log > %EC2-INSTANCE-ID%
as I expect it to. If I ssh into the instance that wrote to the log, there is nothing written on disk under /var/log/eb-docker/containers/eb-current-app
which is also expected.
Observed behavior - prod
When the application deployed to the prod
environment writes something to the console on the other hand, nothing is written to cloudwatch. Cloudwatch log groups appear named /aws/elasticbeanstalk/prod/var/log/eb-docker/containers/eb-current-app/stdouterr.log > %EC2-INSTANCE-ID%
but no events are ever logged. If I ssh into the instance that wrote to the log, the text logged appears on disk under /var/log/eb-docker/containers/eb-current-app/eb-%SOME_HASH%-stdouterr.log
and if the Instance log streaming to CloudWatch Logs
is left enabled, all the instances eventually fill up their available disk space with log contents and crash.
This condition has survived multiple instance restarts, waits of multiple hours with the streaming option enabled, the termination and rebuild of every instance in the environment, and deployment of new application versions from ECS.
If I clone stage
to a new environment, log streaming works as expected. If I clone prod
to a new environment, log streaming fails in exactly the same manner as the original environment. Something is clearly misconfigured for prod
but I don't have a clue what it is. What am I missing?
Update - We terminated and rebuilt "prod" via a terraform script, and log streaming is working there now. No clue what the original problem was.
HOWEVER, now what happens is the individual instances slowly fill up their disks because
/var/log/eb-docker/containers/eb-current-app/eb-###-stdouterror.log
file never seems to be deleted or truncated.My understanding is that these instances are supposed to be, by default, configured with log rotation to handle this, but it either doesn't seem to be working correctly, or doesn't run often enough to handle our logging load (which I don't imagine is too aggressive, a few lines of JSON per request coming in over the GraphQL API).
What am I missing here?