Trouble with cfn-signal: CloudFormation Stack Not Progressing Despite Successful Signal Events

0

I'm facing an issue with my CloudFormation stack. I have an EC2 autoscaling group in a nested stack with an creation policy defined. Despite successful signal events being sent (they are present in CloudTrail and have the correct stack name and logical resource ID of the autoscaling group) the stack never progresses to a CREATE_COMPLETE state and is ultimately rolled back due to no success signals being received. Any ideas?

Initialization of the ASG:

const eni = new CfnNetworkInterface(...)
const instanceProfile = new InstanceProfile(...)

const asg = new CfnAutoScalingGroup(this, 'autoscaling-group', {
    availabilityZones: [props.availabilityZone],
    autoScalingGroupName: `${props.env}-${props.app}-${props.hostname}-autoscaling-group`,
    maxSize: '1',
    minSize: '0',
    desiredCapacity: '1',
    defaultInstanceWarmup: 60,
})

const launchTemplate = new CfnLaunchTemplate(this, 'launch-template', {
    launchTemplateData: {
        instanceType: props.instanceType.toString(),
        keyName: props.keyName,
        imageId: props.machineImage.getImage(this).imageId,
        iamInstanceProfile: {
            arn: instanceProfile.instanceProfileArn,
        },
        networkInterfaces: [{
            deviceIndex: 0,
            networkInterfaceId: eni.ref.toString(),
        }],
        userData: Fn.base64(instanceUserData({
                env: props.env,
                class: props.class,
                hostname: props.hostname,
                stackName: Stack.of(this).stackName,
                asgLogicalId: asg.logicalId,
            }),
        ),
    }
})

launchTemplate.addDependency(eni)
asg.addDependency(launchTemplate)

asg.addPropertyOverride('LaunchTemplate', {
    LaunchTemplateId: launchTemplate.ref,
    Version: launchTemplate.attrLatestVersionNumber.toString(),
})

asg.cfnOptions.creationPolicy = {
    autoScalingCreationPolicy: {
        minSuccessfulInstancesPercent: 100,
    },
    resourceSignal: {
        count: 1,
        timeout: 'PT5M',
    }
}

The user data script has the following call to cfn-signal:

/usr/bin/cfn-signal -e $? --stack $STACK_NAME --resource $RESOURCE_LOGICAL_ID --region $AWS_REGION

A sample redacted event payload from CloudTrail:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "Unknown",
        "principalId": "",
        "accountId": "redacted",
        "userName": ""
    },
    "eventTime": "2023-11-17T07:13:52Z",
    "eventSource": "cloudformation.amazonaws.com",
    "eventName": "SignalResource",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "redacted",
    "userAgent": "CfnTools/2.0-23 (Linux-6.1.61-85.141.amzn2023.x86_64-x86_64-with-glibc2.34) python/3.9.16",
    "requestParameters": {
        "uniqueId": "i-redacted",
        "status": "SUCCESS",
        "logicalResourceId": "redacted2B222222",
        "stackName": "dev-examplestackNestedCD33333E-1J44444444444"
    },
    "responseElements": null,
    "requestID": "redacted",
    "eventID": "redacted",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "redacted",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "cloudformation.us-east-1.amazonaws.com"
    }
}
  • There's a 5 minute timeout set on the signal, is the bootstrapping taking longer then that? I see a timestamp in the cloudtrail output, can you compare that with the rollback time in the stack events?

  • The timeout was adjusted to 5 minutes during our investigation. The original timeout was set for 10 minutes -- the cfn-signal is sent within 2 minutes of the instance transitioning to the RUNNING state.

  • When you look in the CFN Stack Events, does it show the ASG timed out from something like "no signals received, treating as failure due to MinSuccessfulInstances"? I just replicated this in a normal CFN template (no CDK), and it worked, so it doesn't seem like a bug. Do you have a support plan on this account to be able to open a case for someone to review your resources?

    The only thing I can think of at this point is maybe the wrong logical_ID or nestedStack names are getting passed in?

1 Answer
1

Are you sure you signal ASG and not the instance itself? From above it isn't clear what gets populated to $RESOURCE_LOGICAL_ID.

/usr/bin/cfn-signal -e $? --stack $STACK_NAME --resource $RESOURCE_LOGICAL_ID --region $AWS_REGION
    "requestParameters": {
        "uniqueId": "i-redacted",
        "status": "SUCCESS",
        "logicalResourceId": "redacted2B222222",
        "stackName": "dev-examplestackNestedCD33333E-1J44444444444"
    },
profile picture
EXPERT
Kallu
answered 5 months ago
  • @Kallu I am signaling the ASG's logical ID. Trying to signal the instance itself results in an error:

    ValidationError: Resource i-00000000000000000 does not exist for stack example-stack-name
    

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions