Cloudformation "Internal Failure" on drift detection with certain resources

1

One customer has a config rule to detect drifts on our stacks. Since Friday all monitored stacks have been thrown the "Internal Failure" error. We could pinpoint it down to some resources giving us this error when drift detection is run on the complete stack. So far it is AWS::IAM::ManagedPolicy and AWS::Config::ConfigRule.

here is a PoC on how to reproduce this:

AWSTemplateFormatVersion: "2010-09-09"
Description: PoC stack for Failed to detect drift on resources Internal Failure

Resources:
  S3Bucket:
    Type: "AWS::S3::Bucket"
    DeletionPolicy: Delete

  DenyAllPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Deny"
            Action:
              - "*"
            Resource: "*"

When deployed you can run these commands to see the behavior:

This will set the drift id to the env variable DRIFT_ID (make sure to replace the <stack_name>)

$ DRIFT_ID=$(aws cloudformation detect-stack-drift --stack-name <stack-name> --query StackDriftDetectionId --output text )

Which then will be used in this command to get the actual results of the drift detection:

$ aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id $DRIFTID

{
  "StackId": "arn:aws:cloudformation:<region>:<account_id>:stack/drift-poc/<id>",
  "StackDriftDetectionId": "<detection_id>",
  "StackDriftStatus": "IN_SYNC",
  "DetectionStatus": "DETECTION_FAILED",
  "DetectionStatusReason": "{\"Summary\":\"Failed to detect drift on resources [S3Bucket]\",\"Failures\":[{\"Resource\":\"S3Bucket\",\"FailureReason\":\"Internal Failure\"}]}",
  "DriftedStackResourceCount": 0,
  "Timestamp": "2024-03-12T07:59:21.951000+00:00"
}


What we also tried was to run a drift detection for all resources individually, which worked fine.

$ aws cloudformation detect-stack-resource-drift --stack-name drift-poc --logical-resource-id S3Bucket
{
  "StackResourceDrift": {
    "StackId": "arn:aws:cloudformation:<region>:<account_id>:stack/drift-poc/<id>",
    "LogicalResourceId": "S3Bucket",
    "PhysicalResourceId": "drift-poc-s3bucket-<hash>",
    "ResourceType": "AWS::S3::Bucket",
    "ExpectedProperties": "{\"Tags\":[{\"Key\":\"project\",\"Value\":\"drift-poc\"}]}",
    "ActualProperties": "{\"Tags\":[{\"Key\":\"project\",\"Value\":\"drift-poc\"}]}",
    "PropertyDifferences": [],
    "StackResourceDriftStatus": "IN_SYNC",
    "Timestamp": "2024-03-12T08:52:41.603000+00:00"
  }
}
$ aws cloudformation detect-stack-resource-drift --stack-name drift-poc --logical-resource-id DenyAllPolicy
{
  "StackResourceDrift": {
    "StackId": "arn:aws:cloudformation:<region>:<account_id>:stack/drift-poc/<id>",
    "LogicalResourceId": "DenyAllPolicy",
    "PhysicalResourceId": "arn:aws:IAM::<account_id>:policy/drift-poc-DenyAllPolicy-peOszzUwXpYh",
    "ResourceType": "AWS::IAM::ManagedPolicy",
    "ExpectedProperties": "{\"PolicyDocument\":{\"Version\":\"2012-10-17\",\"Statement\":[{\"Action\":[\"*\"],\"Resource\":\"*\",\"Effect\":\"Deny\"}]}}",
    "ActualProperties": "{\"PolicyDocument\":{\"Version\":\"2012-10-17\",\"Statement\":[{\"Action\":[\"*\"],\"Resource\":\"*\",\"Effect\":\"Deny\"}]}}",
    "PropertyDifferences": [],
    "StackResourceDriftStatus": "IN_SYNC",
    "Timestamp": "2024-03-12T08:55:08.668000+00:00"
  }
}

As a workaround, we would need to deploy those resources in its own stack and exclude them from monitoring, which might be ok as a temporary solution. But this is not something we want to have permanent on our client's infrastructure.


EDIT: I could figure that AWS::IAM::ManagedPolicy requires a Groups, Users or Roles property. So instead of attaching the Policy on the User resource, you need to add the entity in the Policy. The Problem for AWS::Config::ConfigRule still persists though

asked 2 months ago334 views
2 Answers
1
Accepted Answer

The problem has been solved. We did absolutely nothing on the original stack and everything is compliant again. I guess someone fixed something internally on AWS.

answered a month ago
-1

The Failed to detect drift on resources Internal Failure error occurs when CloudFormation is unable to detect drift for one or more resources during a stack drift detection operation. Some things you can check:

  • Make sure the IAM role used for drift detection has the necessary permissions to access and describe all resource properties. The role needs at least s3:GetObject permissions to detect drift on S3 buckets.
  • Wait some time after deployment and try detecting drift again. There may be eventual consistency issues if drift detection is triggered too soon after deploying changes.
  • Review the stack resources and events for any errors or failures during deployment that could impact drift detection.
  • Try deleting and recreating the problematic resources, then detect drift on the updated stack.
  • As a workaround, you can run the detect-stack-drift command with the --ignore-resource-types flag to skip drift detection on specific resource types experiencing issues temporarily.
  • To debug further, you can check the CloudFormation API call logs or CloudTrail events for errors during the drift detection operation. Contact AWS Support if the issue persists after retrying with the above suggestions.
profile picture
EXPERT
answered 2 months ago
    • Make sure the IAM role used for drift detection has the necessary permissions to access and describe all resource properties.

    The role needs at least s3:GetObject permissions to detect drift on S3 buckets. This error even occurs when I run the command with my user, which has AdminAccess

    • Wait some time after deployment and try detecting drift again. There may be eventual consistency issues if drift detection is triggered too soon after deploying changes.

    The drift detection worked couple of weeks without any problem, but since Friday it throws this error. We didn't change anything on the stack. The PoC also shows that this can be reproduced on any account.

    Review the stack resources and events for any errors or failures during deployment that could impact drift detection.

    As shown in the PoC template it is successfully deployed and still throws this error

    Try deleting and recreating the problematic resources, then detect drift on the updated stack.

    Already tried and still the same effect.

    As a workaround, you can run the detect-stack-drift command with the --ignore-resource-types flag to skip drift detection on specific resource types experiencing issues temporarily.

    Is this also possible with a Config rule?

    To debug further, you can check the CloudFormation API call logs or CloudTrail events for errors during the drift detection operation.

    Already did, and nothing useful was found, also no error codes were shown

  • Also the flag --ignore-resource-types does not seem to exist

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions