stuck in CREATE_FAILED STATE for waitCondition in CloudFormation

1

I launched the stack for US East (N. Virginia) from the AWS Glue user guide (https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html) to use the yaml template to get a better understanding of creating resources using AWS Cloudformation. I have been stuck on the CREATE_FAILED STATE for the waitCondition resource. I read that using the waitcondition Handle to create an EC2 instance is not a best practice and a creation policy is better. I took off the wait handle and edited the wait condition to include a creation policy but even when I reduced the counts to include the 4 resources it was expected to create, it still returned a failed result and rolled back all the resources. Is there something I'm doing wrong? Below is the script for the template:

Parameters: InstanceType: Type: String Default: t3.small AllowedValues: - t3.micro - t3.small - t3.medium - t3.large - t3.xlarge - t3.2xlarge - m5.large - m5.xlarge - m5.2xlarge - m5.4xlarge - m5.8xlarge - m5.12xlarge - m5.16xlarge - m5.24xlarge - r5.large - r5.xlarge - r5.2xlarge - r5.4xlarge - r5.8xlarge - r5.12xlarge - r5.16xlarge - r5.24xlarge Description: Instance Type for EC2 instance which hosts Spark history server. Enter one of [t3.micro/small/medium/large/xlarge/2xlarge, m5.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge/24xlarge, r5.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge/24xlarge]]. Default is t3.small. LatestAmiId: Type: AWS::SSM::Parameter::ValueAWS::EC2::Image::Id Description: Latest AMI ID of Amazon Linux 2 for Spark history server instance. You can use the default value. Default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 VpcId: Type: AWS::EC2::VPC::Id Description: 'VPC ID for Spark history server instance. You can use a VPC in your account. Warning: Using default VPC with a default NACL is not recommended.' Default: '' SubnetId: Type: AWS::EC2::Subnet::Id Description: Subnet ID for Spark history server instance. You can use any of subnet in your VPC. You need to have network reachability from your client to the subnet. If you want to access via Internet, you would need to use a public subnet which has Internet gateway in the route table. Default: '' IpAddressRange: Type: String Description: 'IP address range that can be used to view the Spark UI. You should use a custom value if you want to restrict access from a specific IP address range. Warning: Using the IP address range of 0.0.0.0/0 would make Spark UI publicly accessible.' MinLength: 9 MaxLength: 18 HistoryServerPort: Type: Number Description: History Server Port for the Spark UI. You can use the default value. Default: 18080 MinValue: 1150 MaxValue: 65535 EventLogDir: Type: String Description: 'Event Log Directory where Spark event logs are stored from the Glue job or dev endpoints. You must use s3a:// for the event logs path scheme (example: s3a://path_to_eventlog).' Default: s3a://path_to_eventlog SparkPackageLocation: Type: String Description: You can use the default value. Default: https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-without-hadoop.tgz KeystorePath: Type: String Description: SSL/TLS keystore path for HTTPS. If you want to use custom keystore file, you can specify the S3 path s3://path_to_your_keystore_file here. If you leave this parameter empty, self-signed certificate based keystore is used. KeystorePassword: Type: String NoEcho: true Description: SSL/TLS keystore password for HTTPS. A valid password can contain 6 to 30 characters. MinLength: 6 MaxLength: 30

Metadata: AWS::CloudFormation::Interface: ParameterGroups: - Label: default: Spark UI Configuration Parameters: - IpAddressRange - HistoryServerPort - EventLogDir - SparkPackageLocation - KeystorePath - KeystorePassword - Label: default: EC2 Instance Configuration Parameters: - InstanceType - LatestAmiId - VpcId - SubnetId

Mappings: MemoryBasedOnInstanceType: t3.micro: SparkDaemonMemory: 512m t3.small: SparkDaemonMemory: 1g t3.medium: SparkDaemonMemory: 3g t3.large: SparkDaemonMemory: 6g t3.xlarge: SparkDaemonMemory: 12g t3.2xlarge: SparkDaemonMemory: 28g m5.large: SparkDaemonMemory: 6g m5.xlarge: SparkDaemonMemory: 12g m5.2xlarge: SparkDaemonMemory: 28g m5.4xlarge: SparkDaemonMemory: 28g m5.8xlarge: SparkDaemonMemory: 28g m5.12xlarge: SparkDaemonMemory: 28g m5.16xlarge: SparkDaemonMemory: 28g m5.24xlarge: SparkDaemonMemory: 28g r5.large: SparkDaemonMemory: 12g r5.xlarge: SparkDaemonMemory: 28g r5.2xlarge: SparkDaemonMemory: 28g r5.4xlarge: SparkDaemonMemory: 28g r5.8xlarge: SparkDaemonMemory: 28g r5.12xlarge: SparkDaemonMemory: 28g r5.16xlarge: SparkDaemonMemory: 28g r5.24xlarge: SparkDaemonMemory: 28g

Resources: Imds2LaunchTemplate: Type: AWS::EC2::LaunchTemplate Properties: LaunchTemplateData: MetadataOptions: HttpEndpoint: enabled HttpTokens: required HistoryServerInstance: Type: AWS::EC2::Instance Properties: LaunchTemplate: LaunchTemplateId: !Ref Imds2LaunchTemplate Version: !GetAtt Imds2LaunchTemplate.LatestVersionNumber ImageId: !Ref LatestAmiId InstanceType: !Ref InstanceType SubnetId: !Ref SubnetId SecurityGroupIds: - !Ref InstanceSecurityGroup IamInstanceProfile: !Ref HistoryServerInstanceProfile UserData: !Base64 Fn::Sub: | #!/bin/bash -xe yum update -y aws-cfn-bootstrap echo "CA_OVERRIDE=/etc/pki/tls/certs/ca-bundle.crt" >> /etc/environment export CA_OVERRIDE=/etc/pki/tls/certs/ca-bundle.crt rpm -Uvh https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm pip3 install requests /opt/aws/bin/cfn-init -v -s ${AWS::StackName} -r HistoryServerInstance --region ${AWS::Region} /opt/aws/bin/cfn-signal -e -s ${AWS::StackName} -r HistoryServerInstance --region ${AWS::Region} Metadata: AWS::CloudFormation::Init: configSets: default: - cloudwatch_agent_configure - cloudwatch_agent_restart - spark_download - spark_init - spark_configure - spark_hs_start - spark_hs_test cloudwatch_agent_configure: files: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json: content: !Sub | { "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/cfn-init.log", "log_group_name": "/aws-glue/sparkui_cfn/cfn-init.log" }, { "file_path": "/opt/spark/logs/spark-", "log_group_name": "/aws-glue/sparkui_cfn/spark_history_server.log" } ] } } } } cloudwatch_agent_restart: commands: 01_stop_service: command: /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a stop 02_start_service: command: /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s spark_download: packages: yum: java-1.8.0-openjdk: [] maven: [] python3: [] python3-pip: [] sources: /opt: !Ref SparkPackageLocation commands: create-symlink: command: ln -s /opt/spark- /opt/spark export: command: !Sub | echo "export JAVA_HOME=/usr/lib/jvm/jre" | sudo tee -a /etc/profile.d/jdk.sh echo "export SPARK_HOME=/opt/spark" | sudo tee -a /etc/profile.d/spark.sh export JAVA_HOME=/usr/lib/jvm/jre export SPARK_HOME=/opt/spark download-pom-xml: command: curl -o /tmp/pom.xml https://aws-glue-sparkui-prod-us-east-1.s3.amazonaws.com/public/mvn/glue-4_0/pom.xml download-setup-py: command: curl -o /tmp/setup.py https://aws-glue-sparkui-prod-us-east-1.s3.amazonaws.com/public/misc/glue-4_0/setup.py download-systemd-file: command: curl -o /usr/lib/systemd/system/spark-history-server.service https://aws-glue-sparkui-prod-us-east-1.s3.amazonaws.com/public/misc/spark-history-server.service spark_init: commands: download-mvn-dependencies: command: cd /tmp; mvn dependency:copy-dependencies -DoutputDirectory=/opt/spark/jars/ install-boto: command: pip3 install boto --user; pip3 install boto3 --user files: /opt/spark/conf/spark-defaults.conf: content: !Sub | spark.eventLog.enabled true spark.history.fs.logDirectory ${EventLogDir} spark.history.ui.port 0 spark.ssl.historyServer.enabled true spark.ssl.historyServer.port ${HistoryServerPort} spark.ssl.historyServer.keyStorePassword ${KeystorePassword} group: ec2-user mode: '000644' owner: ec2-user /opt/spark/conf/spark-env.sh: content: !Sub - | export SPARK_DAEMON_MEMORY=${SparkDaemonMemoryConfig} export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" - SparkDaemonMemoryConfig: !FindInMap - MemoryBasedOnInstanceType - !Ref InstanceType - SparkDaemonMemory group: ec2-user mode: '000644' owner: ec2-user spark_configure: commands: create-symlink: command: ln -s /usr/lib/systemd/system/spark-history-server.service /etc/systemd/system/multi-user.target.wants/ enable-spark-hs: command: systemctl enable spark-history-server configure-keystore: command: !Sub | python3 /tmp/setup.py --keystore "${KeystorePath}" --keystorepw "${KeystorePassword}" > /tmp/setup_py.log 2>&1 spark_hs_start: commands: start_spark_hs_server: command: systemctl start spark-history-server spark_hs_test: commands: check-spark-hs-server: command: !Sub | curl --retry 60 --retry-delay 10 --retry-max-time 600 --retry-connrefused https://localhost:${HistoryServerPort} --insecure; /opt/aws/bin/cfn-signal -e $? "${WaitHandle}" WaitHandle: Type: AWS::CloudFormation::WaitConditionHandle WaitCondition: Type: AWS::CloudFormation::WaitCondition DependsOn: HistoryServerInstance Properties: Handle: !Ref WaitHandle Timeout: 1200 InstanceSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Enable HTTPS access VpcId: !Ref VpcId SecurityGroupIngress: - IpProtocol: tcp FromPort: !Ref HistoryServerPort ToPort: !Ref HistoryServerPort CidrIp: !Ref IpAddressRange HistoryServerRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: - ec2.amazonaws.com Action: - sts:AssumeRole Path: / Policies: - PolicyName: root PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - kms:Decrypt Resource: '*' ManagedPolicyArns: - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy HistoryServerInstanceProfile: Type: AWS::IAM::InstanceProfile Properties: Path: / Roles: - !Ref HistoryServerRole Outputs: SparkUiPublicUrl: Description: The Public URL of Spark UI Value: !Join - '' - - https:// - !GetAtt HistoryServerInstance.PublicDnsName - ':' - !Ref HistoryServerPort SparkUiPrivateUrl: Description: The Private URL of Spark UI Value: !Join - '' - - https:// - !GetAtt HistoryServerInstance.PrivateDnsName - ':' - !Ref HistoryServerPort CloudWatchLogsCfnInit: Description: CloudWatch Logs Console URL for cfn-init.log in History Server Instance Value: !Join - '' - - https://console.aws.amazon.com/cloudwatch/home?region= - !Ref AWS::Region - '#logEventViewer:group=/aws-glue/sparkui_cfn/cfn-init.log;stream=' - !Ref HistoryServerInstance CloudWatchLogsSparkHistoryServer: Description: CloudWatch Logs Console URL for spark history server logs in History Server Instance Value: !Join - '' - - https://console.aws.amazon.com/cloudwatch/home?region= - !Ref AWS::Region - '#logEventViewer:group=/aws-glue/sparkui_cfn/spark_history_server.log;stream=' - !Ref HistoryServerInstance

asked 2 months ago103 views
1 Answer
0

It appears that you're encountering issues with a CloudFormation stack creation, specifically with a WaitCondition resource. The problem you're experiencing is likely related to the way the WaitCondition is being used in your template.

First, it's important to note that you're correct about the best practices. Using a CreationPolicy attribute instead of a WaitCondition is indeed recommended for Amazon EC2 and Auto Scaling resources. However, simply removing the WaitConditionHandle and adding a CreationPolicy to the WaitCondition itself isn't the correct approach.

The WaitCondition in your template is dependent on receiving a signal from the EC2 instance (HistoryServerInstance) that it has completed its setup. This signal is sent using the cfn-signal command in the UserData script of the EC2 instance.

Here are a few things to check and modify in your template:

  1. Make sure the cfn-signal command in the UserData script is correct. It should be sending a signal to the WaitHandle, not directly to the WaitCondition.

  2. Instead of modifying the WaitCondition, you should add a CreationPolicy to the HistoryServerInstance resource. This CreationPolicy will replace the need for a separate WaitCondition. Here's how you can modify your HistoryServerInstance resource:

    HistoryServerInstance: Type: AWS::EC2::Instance CreationPolicy: ResourceSignal: Timeout: PT20M Properties: # ... other properties ... UserData: Fn::Base64: !Sub | #!/bin/bash -xe # ... other commands ... /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource HistoryServerInstance --region ${AWS::Region}

  3. Remove the WaitHandle and WaitCondition resources from your template.

  4. Make sure all the resources that the HistoryServerInstance depends on are created before it. Use DependsOn if necessary.

  5. Check the CloudWatch logs for the cfn-init.log to see if there are any errors during the instance setup.

If you make these changes and still encounter issues, you may need to investigate further by checking the EC2 instance logs or CloudFormation events for more detailed error messages. Remember, when CloudFormation receives the signal from cfn-signal, it will continue with the stack creation. If it doesn't receive the signal within the timeout period (20 minutes in the example above), it will mark the creation as failed.
Sources
AWS::CloudFormation::WaitCondition - AWS CloudFormation
Create wait conditions in a CloudFormation template - AWS CloudFormation
AWS::CloudFormation::WaitCondition - AWS CloudFormation

profile picture
answered 2 months ago
  • I made the modifications as suggested: added the creation policy to the EC2 instance, took off WaitHandle and WaitCondition. The instance never got created and the stack got rolled back. Is the template given on the documentation guide outdated?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions