Anatomy of efficient CloudFormation templates for large-scale automated testing, AIOps, MLOps, etc.
This article details how we structure CloudFormation templates for improved parallelism, cost-efficiency, security and performances
1. Introduction
In my current ML/GenAI activities on automatic test code generation with LLMs for AWS Mainframe Modernization, we need to deploy (and delete after use) lots of application environments to allow LLMs and other models to obtain feedback from test execution for learning on fresh and unbiased environments. So, we make a massive use of AWS CloudFormation (CFN). It allows us to automate 100% our MLOps / AIOps workflows with repeatable quality while remaining cost-efficient.
In this article, we’ll detail some of our best practices to deliver on those objectives
2. Rationale for CloudFormation and Infrastructure as Code
We use CloudFormation templates to restore the exact same initial state of AWS resources and application data each time we replay a test scenario. It is required to be able to compare results of successive iterations of our models. All the resources that the CloudFormation engine creates together from the template are grouped as a resource stack.
CFN brings all the advantages of Infrastructure-as-Code:
- quality by repeatability: once your CFN template is valid, you are certain that each time you repeat the creation of your stack, CloudFormation will deliver you the exact same configuration to run your tests that it did in previous iteration.
- scalablity: thanks to the elasticity of AWS cloud, you can create as many resource stacks as you need in parallel on the same template to run your various tests in parallel and accelerate your project . The only limitation may come from the resource quotas assigned to your account. They can be increased as needed by AWS Support.
- automation: CFN provides a full set of API to create / delete and manage a stack. So, with this “programmatic infrastructure”, you can automatically create and manage very large-scale architectures after you wrote your scripts leveraging those APIs
- costs : resources of a single template are created in parallel. They are also deleted in parallel right after end of test. This workflow minimizes the resulting costs based on duration with the pay-as-you-use model of AWS Cloud.
- speed: CloudFormation engine is optimized to accelerate creation and destruction in right order of all required resources. It beats a human being doing the same any time as the engine know the right order of creation / deletion to obtain shortest time
Application test scenarios may process business-confidential data. So, their CFN templates used as initial conditions must include best practices to obtain optimal security posture for such a use case.
The following diagram shows the architecture of the sample CloudFormation template for a basic single-instance AWS Mainframe Modernization environment. You can download this CloudFormation template fromhere. We focus on YAML syntax because it is concise and more readable compared to JSON.
If you adapt this template for your specific use case, especially if you need to distribute it to customers or partners, make sure to follow the AWS best practices for CFN templates.
The following sections contain details about how the template is structured so that you can understand how the various parts relate to each other.
2. Best practices for CFN templates
In our use case, we must run several instances of the same CFN stacks simultaneously to create identical initial conditions. Running in parallel like this accelerates our MLOps cycles. In doing so, our workflow leverages cloud elasticity (as much resources as you need at any given moment) while remaining optimally frugal (resources used for the minimal amount of time and deleted as soon as they are no longer needed).
a) Resource naming: productivity of developers and sysops is raised by meaningful resource names (i..e not those automatically generated by the system) when they navigate across them for various actitivities. Many types AWS resources that are part of a CloudFormation stack require unique names in their service (for example, no two AWS RDS databases may have same name8. So, our reference template defines an SSM parameter to create a unique suffix (see the AWS::SSM::Parameter
with logical name UniqueSuffix
in the following example). This SSM parameter is based on a chunk of the CFN pseudo-parameter AWS::StackId
(see Pseudo parameters reference in the CloudFormation User Guide). In the CFN template, we append this unique suffix to a meaningful resource name wherever required to ensure that it is unique. This suffix is common across resources of a same stack: it allows a developer to easily “assemble” the pieces of the same stack. See the M2Name
parameter in the following example.
UniqueSuffix:
Type: AWS::SSM::Parameter
DeletionPolicy: Delete
Properties:
Type: 'String'
Value: !Select [0, !Split ['-', !Select [2, !Split [/, !Ref AWS::StackId ]]]]
M2Name:
Type: AWS::SSM::Parameter
DeletionPolicy: Delete
Properties:
Type: 'String'
Value: !Join
- '-'
- - !Ref Label
- !GetAtt UniqueSuffix.Value
b) Full-stack definitions: The complete template defines a full stack of resources from an M2 Application to the Aurora/RDS database that the application uses to store application data. It also includes low-level networking components (VPC, private subnets, internet gateway, associated security groups and so on) to create a fully autonomous and isolated application stack of resources.
We chose this architecture for the following reasons:
- Resilience: we could have defined a CFN stack that relied on the default VPC and subnets that AWS defines automatically for each region in which you can work. But, the default VPC and subnets are very probably also used by several other components that run in parallel in your account. So, if we need to update the configuration of those default networking components for different purpose in our use case, we might break other running and productive use cases. Reciprocally, if others update those default resources for another of use case, they might break the configuration required for our AI activities.
- Confidentiality: with a dedicated VPC, the communication between the M2 application and the RDS database happens on a dedicated subnet. This prevents other non-related programs from snooping, which is possible if we use a shared subnet for this communication.
- Security: The security groups, Secret Manager secrets for database credentials and KMS keys are dedicated to this stack. So, security rules are not shared as they would be if they were embodied by the security groups associated with the default VPC. Therefore, we can ensure that the security groups are tightly restricted to the specific test scenario being run. Shared security groups could require broader permissions to allow access for additional protocols by other parallel use cases. This requirement would decrease the security for the application and corresponding business data being tested, which is potentially confidential. Similarly, secrets for database credentials and their encrypting KMS keys are not shared with any other use case within the account, which results in greater security for AppTest and those other use cases.
- Performance: all resources in the stack communicate with each other in complete isolation. This design choice prevents, for example, the network traffic between application and database being slowed down by network traffic, which could be heavy, between other resources in a shared VPC.
3. Template sections and highlights
This section describes the most important and closely related characteristics of the sample template. For more details regarding the structure of a CFN template and details about the content of the different sections, see Template anatomy in the CloudFormation User Guide.
3.1 Outputs section
The final Outputs section of a template is imperative in our use case. M2 applications require the following two outputs: M2EnvironmentId
and M2ApplicationId
. These outputs provide a way for us to get the ids of those two resources in order to interact with them using the Mainframe Modernization APIs in our tests. You must define them using the intrinsic function GetAtt on the following return values:
[EnvironmentId](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-m2-environment.html)
, returned byAWS::M2::Environment
.[ApplicationId](https://docs.aws.amazon.com/en_uw/AWSCloudFormation/latest/UserGuide/aws-resource-m2-application.html)
, returned byAWS::M2::Application
.
The following example shows how to define these using GetAtt
.
M2EnvironmentId:
Description: 'm2 environment id'
Value: !GetAtt M2Env.EnvironmentId
M2ApplicationId:
Description: 'm2 application id'
Value: !GetAtt M2App.ApplicationId
If your application requires an initial data import, like our OSS CardDemo sample application does, you must add a third output. This output allows your application to get the S3 location of the JSON file that contains all definitions and parameters (location, structure, etc.) of the initial data import. Our AI tests subsequently calls the M2 dataset import API with this S3 location of the JSON file as a parameter.
M2ImportJson:
Description: 's3 location of import json'
Value: **!Ref** ImportJsonS3Location
This output simply and directly references the parameter ImportJsonS3Location
, which contains the effective value for the S3 location.
ImportJsonS3Location:
Description: 's3 location of import definitions'
Default: 's3://aws-m2-math-artefacts/mf/card-demo/mf-carddemo-datasets-import.json'
Type: String
For other advanced use cases, you could construct this output in a more sophisticated manner, such as depending on some dynamically created bucket or other mechanisms.
3.2 Parameters section
In the first section of the template, Parameters, we try to define all the values that users might want to change in the template to adapt it to their own contexts. The goal is to avoid changing any value in the Resources section when customization is needed. The Parameters section of the template makes it easy to locate where changes were made and to avoid inappropriate changes in the Resources section. To leverage this practice to its maximum, we could also have defined the Aurora database parameters (EngineVersion
, MinCapacity
, MaxCapacity
) in this section.
Another advantage of the Parameters section is that customers can change the value of the parameters interactively in the CloudFormation console (or programmatically via scripts) when they are launching the stack manually for testing purposes.
3.3 Application definition
We define the M2 application definition as an SSM parameter with its long JSON string as the value. This choice allows CloudFormation to display it in the Outputs section of the template after all variable substitution is complete (see Outputs section). This choice also allows CloudFormation the task of parsing this parameter value to replace all variables in it with live parameter values of parameters and CFN references that correspond to the specific instantiation of the CloudFormation stack.
M2AppDef:
Type: AWS::SSM::Parameter
DeletionPolicy: Delete
Properties:
Type: 'String'
Value: !Sub |
{
"template-version": "2.0",
"source-locations": [
{
"source-id": "s3-source",
"source-type": "s3",
"properties": {
"s3-bucket": "${BucketName}",
"s3-key-prefix": "${AppKey}"
}
}
],
...
"xa-resources": [
{
"name": "XASQL",
"secret-manager-arn": "${M2DbSecret}",
"module": "${!s3-source}/xa/ESPGSQLXA64.so"
}
The previous example shows two use distinct cases.
${BucketName}
and${AppKey}
are replaced by the value of the corresponding parameter from the Parameters section.${M2DbSecret}
is dynamically replaced by the ARN of the secret that is created dynamically using theM2DbSecret
resource of the template and supplied as the return value on the reference to the resource with the logical idM2DbSecret
.
Two highlights about the variables referencing another resource from within string substitution:
- They can call a function of the CFN resource: in the following example,
${M2DbName.Value}
calls one of the return values of the CFN intrinsic functionGettAtt
for the resource typeAWS::SSM::Parameter
. In this case, it is the return value specifically calledValue
.
"dataset-location": {
"db-locations": [
{
"name": "${M2DbName.Value}",
"secret-manager-arn": "${M2DbSecret}"
}
]
}
- You can escape
${varname}
syntax used by CFN parsing when needed. This is the case for M2: the exclamation mark in{!s3-source}
is an escape directive for curly braces in CFN. It is parsed as{s3-source}
which is then further used by M2 for the application definition.
"cics-settings": {
"binary-file-location": "${!s3-source}/${LoadlibKey}",
"csd-file-location": "${!s3-source}/${RdefKey}",
"system-initialization-table": "${SitKey}"
},
3.4 Inter-resource dependencies
CloudFormation tries to minimize the time needed for the stack to reach CREATE_COMPLETE
status. This status means that all resources created by the template are ready for use. Consequently, the CloudFormation engine launches all possible resource creations in parallel.
But, CloudFormation provides a mechanism - the DependsOn attribute - to create dependencies between resources in order to orchestrate their creation sequence when they depend on each other. For example, the M2App resource can’t be created until the M2DbCluster resource is created, as shown in the following example:
M2App:
Type: AWS::M2::Application
DeletionPolicy: Delete
DependsOn: M2DbCluster
Properties:
Name: !GetAtt M2Name.Value
Description: !Join
- ' '
- - 'm2 application:'
- !GetAtt M2Name.Value
EngineType: !Ref EngineType
Definition:
Content: !GetAtt M2AppDef.Value
Tags:
'app-name': !GetAtt M2Name.Value
If a dependency like this isn’t explicit, the application resource might be created faster than the database resource, and then fail because of the dependency. The DependsOn
attribute guarantees that application creation starts only when the database is available.
3.5 Application database
We create here an empty and fresh Aurora serverless database because the initial data load is implemented by importing data sets. But, it would be possible to restore a snapshot that you created previously using the AWS CLI command [aws rds create-db-snapshot](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-snapshot.html)
(or [aws rds create db-cluster-snapshot](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/rds/create-db-snapshot.html)
).
The CFN template must then include the corresponding id of the snapshot to be restored using the DbSnapshotIdentifier
parameter.
You can also adapt this template to start a classical provisioned RDS instance / cluster if needed: see this guide to select the right one for your use case.
3.6 Network resources
The CFN template creates its own independent network resources: VPC, subnets, internet gateway, route table, and so on. As explained in section Global Aspects above, doing so increases resiliency, security, confidentiality, and performance.
The VPC address block is chosen from among those ranges recommended for private subnets in RFC1918. Choices in those ranges also increase security: those address blocks are never routed on the public internet by any internet service provider. So, resources internal to our VPC cannot be “physically” reached from the outside of it unless the resource itself is defined as publicly reachable.
For example, we set our AWS::M2::Environment
resource to PubliclyAccessible: true
because we want the 3270 endpoint of CardDemo to be accessible for transactional use from the M2 application. For this purpose, the AWS::EC2::InternetGateway
resource uses its NAT features to allow address translation between the RFC1918 private address and a public routable address that comes from the AWS Elastic IP address pool.
We don’t have to take specific care in the choice of the CIDR address block: RFC1918 allows multiple reuse of the same IP address block in multiple locations of a given network as long as proper NAT architecture is in place. That’s why we chose a large Class A block with 10.0.0.0. This choice allows us to create multiple independent pairs of subnets for parallel test scenarios.
Regarding the security group, the architecture is simple. We define only one such group, as shown in the following example.
M2VpcSecGroup:
Type: AWS::EC2::SecurityGroup
DeletionPolicy: Delete
Properties:
VpcId: !Ref M2Vpc
GroupDescription: 'security group for vpc'
SecurityGroupEgress:
- IpProtocol: -1
CidrIp: '0.0.0.0/0'
FromPort: 0
ToPort: 65535
Description: 'Allow outbound access'
SecurityGroupIngress:
- IpProtocol: -1
CidrIp: !GetAtt M2Vpc.CidrBlock
FromPort: 0
ToPort: 65535
Description: 'Allow on-vpc inbound access'
- IpProtocol: -1
CidrIp: '0.0.0.0/0'
FromPort: !Ref Tn3270Port
ToPort: !Ref Tn3270Port
Description: 'Allow inbound tn3270 access'
This architecture has the following characteristics:
- it allows any outbound traffic requested by VPC internal resources.
- it allows all on-VPC inbound internal traffic.
- it allows incoming 3270 traffic arriving from any address on the internet using the
[CidrIp ‘0.0.0.0/0](https://serverfault.com/questions/1100250/what-is-the-difference-between-0-0-0-0-0-and-0-0-0-0-1)
attribute on a custom port defined by theTn3270Port
parameter.
You can adapt it as needed to more complex use cases.
相關內容
- 已提問 1 年前lg...
- AWS 官方已更新 2 年前
- AWS 官方已更新 10 個月前