Cloud Formation deployment failing due to "Internal Error"

0

CONTEXT

I've been having trouble deploying my stack from a CFN template. This is a fairly large stack used by an internal team at AWS. There are multipe Glue jobs, S3 buckets, and IAM roles defined in the template.

Deploying the stack is very slow and needs to be done internally. I'm trying to understand why the deployment fails in a specific region (Malaysia). The Stack currently has status "UPDATE_ROLLBACK_COMPLETE", and the "Detect Root Cause" in the "Events Update" tab brought up the following row:

2024-11-15 12:53:07 UTC+0200
	
OneOfMyGlueJobsDefinedInTheTemplate
	
CREATE_FAILED
Likely root cause
	
Internal Failure

I looked at the template for OneOfMyGlueJobsDefinedInTheTemplate. It states that the job has a script location in an S3 bucket, which looks something like this:

        ScriptLocation:
          Fn::Sub: [
            "s3://${S3Bucket}/${S3Key}/MyBigCFNProject/glue_jobs/S3bucketName/my_script.py",
            {
              S3Bucket: BATS::SAM::CodeS3Bucket,
              S3Key: BATS::SAM::AdditionalArtifactsS3Path
            }
          ] 

Digging for more information I found https://repost.aws/knowledge-center/cloudformation-internal-failure-error, which led me to CloudTrail events. I download the entire CSV file of events on the day the deployment failed and looked for error descriptions which would pinpoint the error. Nothing appears to display the reason for failure.

I'm not sure if it's the S3 bucket specified in OneOfMyGlueJobsDefinedInTheTemplate. Repeated deployments are tricky our internal pipeline deploys in waves across multiple regions, and failures occur specifically in the Malaysia region.

I'm considering forking my project and actually deploying with my own developer account, just to iterate faster.

So here's my question:

Are there more efficient ways to find out exactly why this deployment is failing? Maybe hidden logs somewhere, or a validation tool I can use before deploying the stack to detect problems early?

(Note: I've gone though the "relevant content" that pops up when typing out the question, nothing particularly helpful except a recommendation to contact AWS support to get the to the root of the issue.)

1 Answer
0

There are indeed several more efficient ways to troubleshoot and find out why your CloudFormation deployment is failing. Here are some approaches you can consider:

  1. Use the "Detect root cause" feature: You've already used this, which is a good start. It helps identify the likely root cause of the failure by adding a "Likely root cause" label to the specific event status.

  2. Examine CloudTrail events: You've also done this, which is another good step. Sometimes, the CloudTrail events can provide more detailed information about the failure.

  3. Review the CloudFormation logs: Check the CloudFormation logs for any reference to AWS Config and Systems Manager-related issues. These services can sometimes cause conflicts with CloudFormation deployments.

  4. Check for resource constraints: Ensure that you haven't reached the maximum number of allowed resources in your target account. For example, you might have hit the limit of allowed IAM roles.

  5. Validate your template: Before deployment, use the AWS CloudFormation Linter (cfn-lint) to validate your template. This can help catch errors before you attempt to deploy.

  6. Check permissions: Ensure that you have sufficient permissions in the target account to create all the resources specified in your template.

  7. Verify global resource uniqueness: If your template is creating global resources like S3 buckets, make sure the names are unique across all regions.

  8. Use the deployment timeline view: This new feature in CloudFormation provides a visualization of the sequence of actions during a stack operation. It can help you understand the exact order of resource provisioning and identify where things are going wrong.

  9. Check for dependencies: Ensure that all necessary dependencies are properly defined in your template. You might need to add DependsOn attributes to some resources.

  10. Verify the script location: Double-check that the S3 bucket and key specified for your Glue job script actually exist and are accessible.

  11. Consider regional differences: Since the failure is specific to the Malaysia region, investigate if there are any region-specific constraints or limitations that might be causing the issue.

  12. Use AWS Config: This service can help you assess, audit, and evaluate the configurations of your AWS resources, which might reveal issues affecting your deployment.

If these methods don't reveal the cause of the failure, your idea of deploying with your own developer account to iterate faster could be a good approach. This would allow you to test and debug more quickly without affecting the main deployment pipeline. However, be cautious to ensure your test environment closely mirrors the production environment to avoid introducing new variables.

If the issue persists after trying these methods, contacting AWS Support might indeed be necessary to get to the root of the problem, especially given that this is an internal AWS project with a large and complex stack.
Sources
Determine the cause of a stack failure - AWS CloudFormation
Troubleshooting CloudFormation - AWS CloudFormation
Troubleshoot CloudFormation module errors | AWS re:Post
Peek inside your AWS CloudFormation Deployments with timeline view | AWS DevOps & Developer Productivity Blog

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions