Cloudformation mismanages VPCGatewayAttachment; stack breaks

0

The setup
A pre-existing VPC and Internet Gateway (IG). The IG is attached to the VPC:

$ aws ec2 describe-internet-gateways 
{
    "InternetGateways": [
        {
            "Attachments": [
                {
                    "State": "available",
                    "VpcId": "vpc-7a678b1f"
                }
            ],
            "InternetGatewayId": "igw-43d8c621",
            "OwnerId": "xxxxxxxxxxxx",
            "Tags": []
        }
    ]
}

Note: there are other resources in the VPC but they are not pertinent to the problem.

A Cloudformation nested template that (amongst other things) creates an VPCGatewayAttachment using the id (igw-43d8c621) of the Internet Gateway. The parent stack is configured to roll back upon failure and does this due to a mistake in a sibling nested template (operator fat fingers).

  InternetGatewayAttachment:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      InternetGatewayId: 'igw-43d8c621'
      VpcId: 'vpc-7a678b1f'

Expected behaviour
Cloudformation should fail to create the VPCGatewayAttachment because the InternetGateway it refers to is already attached to the VPC - there is already an attachment present.

It should have commenced rollback at this point.

It should not report having successfully created the attachment.

Actual behaviour
Cloudformation successfully creates the VPCGatewayAttachment and reports this. It then goes on to create the remaining resources in the template but ends up failing due to a mistake in a sibling template.

A rollback is attempted, previously created resources are deleted (as expected), however it fails to delete the VPCGatewayAttachment it reported as having created:

DELETE_FAILED	Network vpc-7a678b1f has some mapped public address(es). Please unmap those public address(es) before detaching the gateway. (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: 0a790e5d-41d0-4c2f-9580-84eb81058b2a)

This nested stack is permanently in a DELETE_FAILED state.

The parent stack is permanently in a ROLLBACK_FAILED state.

What is going on?
Cloudformation is such a black box so my reasoning powers are limited. It reports a "Physical ID" - searc-Inter-19YDP23IEGX15 - but I have no idea what to do with it. None of the tools at my disposal accept this reference.

My hypothesis is that Cloudformation has attempted to create a VPCGatewayAttachment as instructed and internally recorded having done this even though no new attachment is present. This is confirmed by viewing the Internet Gateway's attachments (see first code-block). The pre-existing gateway attachment is now in Cloudformation's ledger of resources under its ownership and control; it mistakenly thinks that it created the VPCGatewayAttachment! Upon rollback it attempts to delete it but can't due to there being publicly mapped IP addresses belonging to the VPC that this gateway attachment refers to.

Outcomes
I'm stalemated. I can neither update them with the correct stack configuration nor delete the stack and start again. All progress on these stacks is halted.

My only resort would be to manually delete the gateway attachment however this option is not available to me as the production workload predicated on this attachment would be negatively impacted.

An alternative would be to spin up a completely new stack sans attachment but I find this a too inelegant solution especially for production.

I'm cucked. Plz halp.

asked 4 years ago1014 views
1 Answer
0

I received a very helpful and enlightening reply from AWS tech support. The answer has been edited for brevity.

tl;dr:
. CloudTrail is a useful diagnostic tool for popping under the hood of CloudFormation
. CloudFormation doesn't always get it right! It assumed control of a resource it did not create
. the erroneous stack could be deleted without removing the existing resource
. AWS tech support is quite good


Embedded stack was not successfully deleted: The following resource(s) failed to delete: InternetGatewayAttachment.

As one of the child stacks failed with the error “TemplateURL must be an Amazon S3 URL.” due to which the parent stack rolled back but then the child stack delete failed with the following error:

Network vpc-xxxxxxxx has some mapped public address(es). Please unmap those public address(es) before detaching the gateway. (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: 0a790e5d-41d0-4c2f-9580-84eb81058b2a)

On reviewing the stack I could see that you are attaching the Internet Gateway with the VPC (vpc-xxxxxxxx) but it was already attached manually.

From the cloud trail, I could observe that when cloudformation tried to attach it, it gave the following error at “2020-06-03T07:03:03 UTC"

_"eventTime": "2020-06-03T07:03:03Z",_  
_"eventSource": "ec2.amazonaws.com",_  
_"eventName": "AttachInternetGateway",_  
_"awsRegion": "ap-southeast-2",_  
_"sourceIPAddress": "cloudformation.amazonaws.com",_  
_"userAgent": "cloudformation.amazonaws.com",_  
_"errorCode": "Client.Resource.AlreadyAssociated",_  
_"errorMessage": "resource igw-xxxxxxxx is already attached to network vpc-xxxxxxxx”_  

But the stack didn’t fail at that point and moved ahead with the creation of the other resources.

To understand the logic behind this behavior, I examined the internal CloudFormation Service Code logic and discovered that it works in the following way: - Tries to attach an Internet Gateway to a VPC - If there is an error other than "Client.Resource.AlreadyAssociated", it will throw an error.

If it encounters the "Client.Resource.AlreadyAssociated" error, it will ignore the error and proceed to create a physical ID for the association (like prod-Attac-1F371BGBSXSSL) This logic seem to hold true for Internet Gateway and VPN Gateway resources according to the part of the CloudFormation I reviewed.

That said, I have not seen any other CloudFormation resources with similar logic where such errors are ignored and, therefore, I would suggest doing a Stack recreation as a best practice in this case.

Now coming to the error:

"Network vpc-xxxxxxxx has some mapped public address(es). Please unmap those public address(es) before detaching the gateway. (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: 0a790e5d-41d0-4c2f-9580-84eb81058b2a)”.

The cloud formation has the “VPCGatewayAttachment" as the resource in its template, and so on deletion it will try to detach it. But as there were resources which are dependent on it and not a part of the cloudformation template, it was unable to detach it and failed with the error in concern.

To delete the stack while retaining the InternetGatewayAttachment resource (will skip that and delete the rest), complete the following steps:

  1. Open the AWS CloudFormation console.
  2. Choose the stack that's stuck in the DELETE_FAILED status.
  3. Choose Delete. A pop-up window opens and lists the resources that failed to delete.
  4. In the pop-up window, select all the resources (in your case it will the InternetGatewayAttachment) that you want to retain, and then choose Delete stack.

The AWS CloudFormation stack tries to delete the stack again, but doesn't delete any of the resources (in your case it will be InternetGatewayAttachment ) that you selected to retain. The status of your stack should change to DELETE_COMPLETE.

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions