Trouble with cfn-signal: CloudFormation Stack Not Progressing Despite Successful Signal Events

0

I'm facing an issue with my CloudFormation stack. I have an EC2 autoscaling group in a nested stack with an creation policy defined. Despite successful signal events being sent (they are present in CloudTrail and have the correct stack name and logical resource ID of the autoscaling group) the stack never progresses to a CREATE_COMPLETE state and is ultimately rolled back due to no success signals being received. Any ideas?

Initialization of the ASG:

const eni = new CfnNetworkInterface(...)
const instanceProfile = new InstanceProfile(...)

const asg = new CfnAutoScalingGroup(this, 'autoscaling-group', {
    availabilityZones: [props.availabilityZone],
    autoScalingGroupName: `${props.env}-${props.app}-${props.hostname}-autoscaling-group`,
    maxSize: '1',
    minSize: '0',
    desiredCapacity: '1',
    defaultInstanceWarmup: 60,
})

const launchTemplate = new CfnLaunchTemplate(this, 'launch-template', {
    launchTemplateData: {
        instanceType: props.instanceType.toString(),
        keyName: props.keyName,
        imageId: props.machineImage.getImage(this).imageId,
        iamInstanceProfile: {
            arn: instanceProfile.instanceProfileArn,
        },
        networkInterfaces: [{
            deviceIndex: 0,
            networkInterfaceId: eni.ref.toString(),
        }],
        userData: Fn.base64(instanceUserData({
                env: props.env,
                class: props.class,
                hostname: props.hostname,
                stackName: Stack.of(this).stackName,
                asgLogicalId: asg.logicalId,
            }),
        ),
    }
})

launchTemplate.addDependency(eni)
asg.addDependency(launchTemplate)

asg.addPropertyOverride('LaunchTemplate', {
    LaunchTemplateId: launchTemplate.ref,
    Version: launchTemplate.attrLatestVersionNumber.toString(),
})

asg.cfnOptions.creationPolicy = {
    autoScalingCreationPolicy: {
        minSuccessfulInstancesPercent: 100,
    },
    resourceSignal: {
        count: 1,
        timeout: 'PT5M',
    }
}

The user data script has the following call to cfn-signal:

/usr/bin/cfn-signal -e $? --stack $STACK_NAME --resource $RESOURCE_LOGICAL_ID --region $AWS_REGION

A sample redacted event payload from CloudTrail:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "Unknown",
        "principalId": "",
        "accountId": "redacted",
        "userName": ""
    },
    "eventTime": "2023-11-17T07:13:52Z",
    "eventSource": "cloudformation.amazonaws.com",
    "eventName": "SignalResource",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "redacted",
    "userAgent": "CfnTools/2.0-23 (Linux-6.1.61-85.141.amzn2023.x86_64-x86_64-with-glibc2.34) python/3.9.16",
    "requestParameters": {
        "uniqueId": "i-redacted",
        "status": "SUCCESS",
        "logicalResourceId": "redacted2B222222",
        "stackName": "dev-examplestackNestedCD33333E-1J44444444444"
    },
    "responseElements": null,
    "requestID": "redacted",
    "eventID": "redacted",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "redacted",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "cloudformation.us-east-1.amazonaws.com"
    }
}
  • There's a 5 minute timeout set on the signal, is the bootstrapping taking longer then that? I see a timestamp in the cloudtrail output, can you compare that with the rollback time in the stack events?

  • The timeout was adjusted to 5 minutes during our investigation. The original timeout was set for 10 minutes -- the cfn-signal is sent within 2 minutes of the instance transitioning to the RUNNING state.

  • When you look in the CFN Stack Events, does it show the ASG timed out from something like "no signals received, treating as failure due to MinSuccessfulInstances"? I just replicated this in a normal CFN template (no CDK), and it worked, so it doesn't seem like a bug. Do you have a support plan on this account to be able to open a case for someone to review your resources?

    The only thing I can think of at this point is maybe the wrong logical_ID or nestedStack names are getting passed in?

1개 답변
1

Are you sure you signal ASG and not the instance itself? From above it isn't clear what gets populated to $RESOURCE_LOGICAL_ID.

/usr/bin/cfn-signal -e $? --stack $STACK_NAME --resource $RESOURCE_LOGICAL_ID --region $AWS_REGION
    "requestParameters": {
        "uniqueId": "i-redacted",
        "status": "SUCCESS",
        "logicalResourceId": "redacted2B222222",
        "stackName": "dev-examplestackNestedCD33333E-1J44444444444"
    },
profile picture
전문가
Kallu
답변함 6달 전
  • @Kallu I am signaling the ASG's logical ID. Trying to signal the instance itself results in an error:

    ValidationError: Resource i-00000000000000000 does not exist for stack example-stack-name
    

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠