Anatomy of efficient CloudFormation templates for large-scale automated testing, AIOps, MLOps, etc.

13 minute read
Content level: Advanced
0

This article details how we structure CloudFormation templates for improved parallelism, cost-efficiency, security and performances

1. Introduction

In my current ML/GenAI activities on automatic test code generation with LLMs for AWS Mainframe Modernization, we need to deploy (and delete after use) lots of application environments to allow LLMs and other models to obtain feedback from test execution for learning on fresh and unbiased environments. So, we make a massive use of AWS CloudFormation (CFN). It allows us to automate 100% our MLOps / AIOps workflows with repeatable quality while remaining cost-efficient.

In this article, we’ll detail some of our best practices to deliver on those objectives

2. Rationale for CloudFormation and Infrastructure as Code

We use CloudFormation templates to restore the exact same initial state of AWS resources and application data each time we replay a test scenario. It is required to be able to compare results of successive iterations of our models. All the resources that the CloudFormation engine creates together from the template are grouped as a resource stack.

CFN brings all the advantages of Infrastructure-as-Code:

  • quality by repeatability: once your CFN template is valid, you are certain that each time you repeat the creation of your stack, CloudFormation will deliver you the exact same configuration to run your tests that it did in previous iteration.
  • scalablity: thanks to the elasticity of AWS cloud, you can create as many resource stacks as you need in parallel on the same template to run your various tests in parallel and accelerate your project . The only limitation may come from the resource quotas assigned to your account. They can be increased as needed by AWS Support.
  • automation: CFN provides a full set of API to create / delete and manage a stack. So, with this “programmatic infrastructure”, you can automatically create and manage very large-scale architectures after you wrote your scripts leveraging those APIs
  • costs : resources of a single template are created in parallel. They are also deleted in parallel right after end of test. This workflow minimizes the resulting costs based on duration with the pay-as-you-use model of AWS Cloud.
  • speed: CloudFormation engine is optimized to accelerate creation and destruction in right order of all required resources. It beats a human being doing the same any time as the engine know the right order of creation / deletion to obtain shortest time

Application test scenarios may process business-confidential data. So, their CFN templates used as initial conditions must include best practices to obtain optimal security posture for such a use case.

The following diagram shows the architecture of the sample CloudFormation template for a basic single-instance AWS Mainframe Modernization environment. You can download this CloudFormation template fromhere. We focus on YAML syntax because it is concise and more readable compared to JSON.

If you adapt this template for your specific use case, especially if you need to distribute it to customers or partners, make sure to follow the AWS best practices for CFN templates.

Enter image description here

The following sections contain details about how the template is structured so that you can understand how the various parts relate to each other.

2. Best practices for CFN templates

In our use case, we must run several instances of the same CFN stacks simultaneously to create identical initial conditions. Running in parallel like this accelerates our MLOps cycles. In doing so, our workflow leverages cloud elasticity (as much resources as you need at any given moment) while remaining optimally frugal (resources used for the minimal amount of time and deleted as soon as they are no longer needed).

a) Resource naming: productivity of developers and sysops is raised by meaningful resource names (i..e not those automatically generated by the system) when they navigate across them for various actitivities. Many types AWS resources that are part of a CloudFormation stack require unique names in their service (for example, no two AWS RDS databases may have same name8. So, our reference template defines an SSM parameter to create a unique suffix (see the AWS::SSM::Parameter with logical name UniqueSuffix in the following example). This SSM parameter is based on a chunk of the CFN pseudo-parameter AWS::StackId (see Pseudo parameters reference in the CloudFormation User Guide). In the CFN template, we append this unique suffix to a meaningful resource name wherever required to ensure that it is unique. This suffix is common across resources of a same stack: it allows a developer to easily “assemble” the pieces of the same stack. See the M2Name parameter in the following example.

  UniqueSuffix:
    Type: AWS::SSM::Parameter
    DeletionPolicy: Delete
    Properties:
      Type: 'String'
      Value: !Select [0, !Split ['-', !Select [2, !Split [/, !Ref AWS::StackId ]]]]

  M2Name:
    Type: AWS::SSM::Parameter
    DeletionPolicy: Delete
    Properties:
      Type: 'String'
      Value: !Join
        - '-'
        -  - !Ref Label
           - !GetAtt UniqueSuffix.Value

b) Full-stack definitions: The complete template defines a full stack of resources from an M2 Application to the Aurora/RDS database that the application uses to store application data. It also includes low-level networking components (VPC, private subnets, internet gateway, associated security groups and so on) to create a fully autonomous and isolated application stack of resources.

We chose this architecture for the following reasons:

  • Resilience: we could have defined a CFN stack that relied on the default VPC and subnets that AWS defines automatically for each region in which you can work. But, the default VPC and subnets are very probably also used by several other components that run in parallel in your account. So, if we need to update the configuration of those default networking components for different purpose in our use case, we might break other running and productive use cases. Reciprocally, if others update those default resources for another of use case, they might break the configuration required for our AI activities.
  • Confidentiality: with a dedicated VPC, the communication between the M2 application and the RDS database happens on a dedicated subnet. This prevents other non-related programs from snooping, which is possible if we use a shared subnet for this communication.
  • Security: The security groups, Secret Manager secrets for database credentials and KMS keys are dedicated to this stack. So, security rules are not shared as they would be if they were embodied by the security groups associated with the default VPC. Therefore, we can ensure that the security groups are tightly restricted to the specific test scenario being run. Shared security groups could require broader permissions to allow access for additional protocols by other parallel use cases. This requirement would decrease the security for the application and corresponding business data being tested, which is potentially confidential. Similarly, secrets for database credentials and their encrypting KMS keys are not shared with any other use case within the account, which results in greater security for AppTest and those other use cases.
  • Performance: all resources in the stack communicate with each other in complete isolation. This design choice prevents, for example, the network traffic between application and database being slowed down by network traffic, which could be heavy, between other resources in a shared VPC.

3. Template sections and highlights

This section describes the most important and closely related characteristics of the sample template. For more details regarding the structure of a CFN template and details about the content of the different sections, see Template anatomy in the CloudFormation User Guide.

3.1 Outputs section

The final Outputs section of a template is imperative in our use case. M2 applications require the following two outputs: M2EnvironmentId and M2ApplicationId. These outputs provide a way for us to get the ids of those two resources in order to interact with them using the Mainframe Modernization APIs in our tests. You must define them using the intrinsic function GetAtt on the following return values:

  • [EnvironmentId](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-m2-environment.html), returned by AWS::M2::Environment.
  • [ApplicationId](https://docs.aws.amazon.com/en_uw/AWSCloudFormation/latest/UserGuide/aws-resource-m2-application.html), returned by AWS::M2::Application.

The following example shows how to define these using GetAtt.

  M2EnvironmentId:
    Description: 'm2 environment id'
    Value: !GetAtt M2Env.EnvironmentId

  M2ApplicationId:
    Description: 'm2 application id'
    Value: !GetAtt M2App.ApplicationId

If your application requires an initial data import, like our OSS CardDemo sample application does, you must add a third output. This output allows your application to get the S3 location of the JSON file that contains all definitions and parameters (location, structure, etc.) of the initial data import. Our AI tests subsequently calls the M2 dataset import API with this S3 location of the JSON file as a parameter.

  M2ImportJson:
    Description: 's3 location of import json'
    Value: **!Ref**  ImportJsonS3Location

This output simply and directly references the parameter ImportJsonS3Location, which contains the effective value for the S3 location.

  ImportJsonS3Location:
    Description: 's3 location of import definitions'
    Default: 's3://aws-m2-math-artefacts/mf/card-demo/mf-carddemo-datasets-import.json'
    Type: String

For other advanced use cases, you could construct this output in a more sophisticated manner, such as depending on some dynamically created bucket or other mechanisms.

3.2 Parameters section

In the first section of the template, Parameters, we try to define all the values that users might want to change in the template to adapt it to their own contexts. The goal is to avoid changing any value in the Resources section when customization is needed. The Parameters section of the template makes it easy to locate where changes were made and to avoid inappropriate changes in the Resources section. To leverage this practice to its maximum, we could also have defined the Aurora database parameters (EngineVersion, MinCapacity, MaxCapacity) in this section.

Another advantage of the Parameters section is that customers can change the value of the parameters interactively in the CloudFormation console (or programmatically via scripts) when they are launching the stack manually for testing purposes.

3.3 Application definition

We define the M2 application definition as an SSM parameter with its long JSON string as the value. This choice allows CloudFormation to display it in the Outputs section of the template after all variable substitution is complete (see Outputs section). This choice also allows CloudFormation the task of parsing this parameter value to replace all variables in it with live parameter values of parameters and CFN references that correspond to the specific instantiation of the CloudFormation stack.

M2AppDef:
    Type: AWS::SSM::Parameter
    DeletionPolicy: Delete
    Properties:
      Type: 'String'
      Value: !Sub | 
        {
          "template-version": "2.0",
          "source-locations": [
            {
              "source-id": "s3-source",
              "source-type": "s3",
              "properties": {
                "s3-bucket": "${BucketName}",
                "s3-key-prefix": "${AppKey}"
              }
            }
          ],
          
          ...
          
          "xa-resources": [
              {
                "name": "XASQL",
                "secret-manager-arn": "${M2DbSecret}",
                "module": "${!s3-source}/xa/ESPGSQLXA64.so"
              }

The previous example shows two use distinct cases.

  • ${BucketName} and ${AppKey} are replaced by the value of the corresponding parameter from the Parameters section.
  • ${M2DbSecret} is dynamically replaced by the ARN of the secret that is created dynamically using the M2DbSecret resource of the template and supplied as the return value on the reference to the resource with the logical id M2DbSecret.

Two highlights about the variables referencing another resource from within string substitution:

  • They can call a function of the CFN resource: in the following example, ${M2DbName.Value} calls one of the return values of the CFN intrinsic function GettAtt for the resource type AWS::SSM::Parameter. In this case, it is the return value specifically called Value.
"dataset-location": {
              "db-locations": [
                {
                  "name": "${M2DbName.Value}",
                  "secret-manager-arn": "${M2DbSecret}"
                }
              ]
            }
  "cics-settings": {
              "binary-file-location": "${!s3-source}/${LoadlibKey}",
              "csd-file-location": "${!s3-source}/${RdefKey}",
              "system-initialization-table": "${SitKey}"
            },

3.4 Inter-resource dependencies

CloudFormation tries to minimize the time needed for the stack to reach CREATE_COMPLETE status. This status means that all resources created by the template are ready for use. Consequently, the CloudFormation engine launches all possible resource creations in parallel.

But, CloudFormation provides a mechanism - the DependsOn attribute - to create dependencies between resources in order to orchestrate their creation sequence when they depend on each other. For example, the M2App resource can’t be created until the M2DbCluster resource is created, as shown in the following example:

M2App:
    Type: AWS::M2::Application
    DeletionPolicy: Delete
    DependsOn: M2DbCluster
    Properties:
      Name: !GetAtt M2Name.Value
      Description: !Join
        - ' '
        -  - 'm2 application:'
           - !GetAtt M2Name.Value
      EngineType: !Ref EngineType
      Definition:
        Content: !GetAtt M2AppDef.Value
      Tags:
        'app-name': !GetAtt M2Name.Value

If a dependency like this isn’t explicit, the application resource might be created faster than the database resource, and then fail because of the dependency. The DependsOn attribute guarantees that application creation starts only when the database is available.

3.5 Application database

We create here an empty and fresh Aurora serverless database because the initial data load is implemented by importing data sets. But, it would be possible to restore a snapshot that you created previously using the AWS CLI command [aws rds create-db-snapshot](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-snapshot.html) (or [aws rds create db-cluster-snapshot](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-snapshot.html)).

The CFN template must then include the corresponding id of the snapshot to be restored using the DbSnapshotIdentifier parameter.

You can also adapt this template to start a classical provisioned RDS instance / cluster if needed: see this guide to select the right one for your use case.

3.6 Network resources

The CFN template creates its own independent network resources: VPC, subnets, internet gateway, route table, and so on. As explained in section Global Aspects above, doing so increases resiliency, security, confidentiality, and performance.

The VPC address block is chosen from among those ranges recommended for private subnets in RFC1918. Choices in those ranges also increase security: those address blocks are never routed on the public internet by any internet service provider. So, resources internal to our VPC cannot be “physically” reached from the outside of it unless the resource itself is defined as publicly reachable.

For example, we set our AWS::M2::Environment resource to PubliclyAccessible: true because we want the 3270 endpoint of CardDemo to be accessible for transactional use from the M2 application. For this purpose, the AWS::EC2::InternetGateway resource uses its NAT features to allow address translation between the RFC1918 private address and a public routable address that comes from the AWS Elastic IP address pool.

We don’t have to take specific care in the choice of the CIDR address block: RFC1918 allows multiple reuse of the same IP address block in multiple locations of a given network as long as proper NAT architecture is in place. That’s why we chose a large Class A block with 10.0.0.0. This choice allows us to create multiple independent pairs of subnets for parallel test scenarios.

Regarding the security group, the architecture is simple. We define only one such group, as shown in the following example.

M2VpcSecGroup:
    Type: AWS::EC2::SecurityGroup
    DeletionPolicy: Delete
    Properties:
      VpcId: !Ref M2Vpc
      GroupDescription: 'security group for vpc'
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: '0.0.0.0/0'
          FromPort: 0
          ToPort: 65535
          Description: 'Allow outbound access'
      SecurityGroupIngress:
        - IpProtocol: -1
          CidrIp: !GetAtt M2Vpc.CidrBlock 
          FromPort: 0
          ToPort: 65535
          Description: 'Allow on-vpc inbound access'
        - IpProtocol: -1
          CidrIp: '0.0.0.0/0'
          FromPort: !Ref Tn3270Port
          ToPort: !Ref Tn3270Port
          Description: 'Allow inbound tn3270 access'

This architecture has the following characteristics:

  • it allows any outbound traffic requested by VPC internal resources.
  • it allows all on-VPC inbound internal traffic.
  • it allows incoming 3270 traffic arriving from any address on the internet using the [CidrIp ‘0.0.0.0/0](https://serverfault.com/questions/1100250/what-is-the-difference-between-0-0-0-0-0-and-0-0-0-0-1) attribute on a custom port defined by the Tn3270Port parameter.

You can adapt it as needed to more complex use cases.