AWS Enterprise Landing Zone
Building an Enterprise Landing Zone
Building a Secure, Scalable, and Well-Architected AWS Enterprise Landing Zone
A comprehensive guide to multi-account strategy, centralized logging, cost accountability, CI/CD, multi-region hybrid connectivity, and Infrastructure as Code — aligned to the AWS Well-Architected Framework.
Disclaimer: This guide presents a reference architecture based on AWS best practices and the Well-Architected Framework. Your actual implementation will vary depending on your organization's compliance requirements (e.g., HIPAA, PCI-DSS, FedRAMP, GDPR, SOC 2), industry regulations, existing infrastructure, risk tolerance, team size, and budget. Treat this as a starting point — not a prescriptive blueprint. Always validate architecture decisions with your security, compliance, and legal teams before deploying to production. AWS services, pricing, and features evolve frequently; verify current capabilities in the AWS documentation at the time of implementation.
Table of Contents
- Introduction
- Before You Begin — Preparation Checklist
- Architecture Overview
- Multi-Account Strategy & OU Structure
- AWS Well-Architected Framework Alignment
- Centralized Logging & Access Tracking
- Mandatory Tagging & Deployment Accountability
- Cost Allocation, Monitoring & Chargeback
- Services by Team
- CI/CD Pipeline Architecture
- Infrastructure as Code — Terraform Starter
- Multi-Region Architecture & Hybrid Connectivity
- Business Continuity & Disaster Recovery
- Alerting & Change Notifications
- Monitoring & Alerting Built Into Deployment
- Additional Best Practices & Considerations
- Phased Rollout Plan
- Conclusion
Introduction
Every enterprise AWS journey starts with the same question: How do we build a foundation that's secure, scalable, cost-transparent, and doesn't become unmanageable at scale?
The answer is a landing zone — a well-architected, multi-account AWS environment with built-in governance, security controls, centralized logging, cost allocation, and automated deployment pipelines. This guide covers everything you need to go from zero to a production-ready enterprise AWS environment, mapped to the six pillars of the AWS Well-Architected Framework.
We'll cover:
- Multi-account strategy with AWS Organizations and Control Tower
- Centralized logging for every API call, network flow, and resource change
- Mandatory tagging to track who deployed what, for which department, at what cost
- CI/CD pipelines for application and infrastructure deployments
- Infrastructure as Code with Terraform (and CloudFormation alternatives)
- Multi-region architecture with hybrid connectivity to on-premises via Direct Connect and VPN
- Alerting and change notifications via EventBridge, SNS, and AWS Chatbot
- Monitoring and alerting baked into every deployment — not bolted on afterward
Before You Begin — Preparation Checklist
Before deploying a single resource, get these decisions and prerequisites in place.
Organizational Decisions
| Decision | What to Define | Why It Matters |
|---|---|---|
| Account email strategy | Dedicated email distribution list per account (e.g., aws-security@company.com) | AWS requires a unique email per account; DLs ensure team access, not individual dependency |
| Naming conventions | Standard for accounts, OUs, resources, tags | Consistency prevents confusion at scale |
| Region strategy | Primary region + DR region + denied regions | Compliance, latency, and cost implications |
| IP address plan (CIDR) | Non-overlapping CIDR ranges across all VPCs | You will regret overlapping CIDRs; plan for 3–5 years of growth |
| Identity provider (IdP) | Okta, Azure AD, Ping, or AWS-native | Federated SSO is non-negotiable for enterprise |
| Compliance requirements | SOC 2, HIPAA, PCI-DSS, FedRAMP, GDPR | Determines log retention, encryption, and network controls |
| Cost center taxonomy | Department → Cost Center → Project mapping | Required for chargeback/showback reporting |
| Change management process | Who approves prod deployments? What's the rollback process? | Must be defined before CI/CD pipelines are built |
Technical Prerequisites
- [ ] Management account created with MFA on root, no workloads deployed
- [ ] AWS Organizations enabled with all features
- [ ] Two dedicated email addresses for Log Archive and Audit accounts (Control Tower requirement)
- [ ] Identity Provider configured and ready for SAML/OIDC federation
- [ ] IP address plan documented — recommended: use AWS VPC IPAM for automated allocation
- [ ] Terraform state backend — S3 bucket + DynamoDB table in a dedicated account
- [ ] Git repository initialized for IaC code (GitHub, GitLab, or CodeCommit)
- [ ] Cost allocation tags decided and documented (see Mandatory Tagging section)
- [ ] Incident response plan — at minimum, define escalation paths and communication channels
- [ ] AWS Support plan — Business or Enterprise Support for production workloads (access to Trusted Advisor checks, TAM, and 24/7 support)
Common Mistakes to Avoid
⚠️ Don't deploy workloads in the management account. It should only run Organizations, Control Tower, and billing. No EC2, no Lambda, no applications.
⚠️ Don't skip the IP address plan. Overlapping CIDRs between VPCs are extremely painful to fix after workloads are running.
⚠️ Don't use IAM users for human access. Use IAM Identity Center (SSO) with your corporate IdP from day one. IAM users are for service accounts only — and even those should use IAM roles where possible.
⚠️ Don't leave CloudTrail as a per-account afterthought. Set up the org-wide trail in the management account first, logging to the Log Archive account.
Architecture Overview
The architecture follows a hub-and-spoke model with centralized security, networking, and logging.
┌─────────────────────┐
│ IAM Identity Center│
│ (Corporate IdP) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Management Account │
│ (Organizations, │
│ Control Tower, │
│ Billing) │
└──────────┬──────────┘
┌───────────────┬───────┴──────┬────────────────┬──────────────┐
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌────────────┐ ┌────────────┐
│ Security OU │ │ Infra OU │ │Workloads OU│ │ DevTools OU│
│ │ │ │ │ │ │ │
│• Log Archive │ │• Network Hub │ │• Dev OU │ │• CI/CD │
│• Security │ │• Shared Svcs │ │• Staging OU│ │ Account │
│ Tooling │ │• Backup │ │• Prod OU │ │ │
│• Audit │ │ │ │ │ │ │
└──────────────┘ └──────┬───────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
┌──────▼───────────────▼───────────────▼────────────────┐
│ Transit Gateway (Hub) │
│ Network Hub Account — Infra OU │
└──────────┬──────────────────┬─────────────────────────┘
│ │
┌──────────▼──────┐ ┌───────▼────────┐
│ Direct Connect │ │ AWS Network │
│ / Site-to-Site │ │ Firewall │
│ VPN │ │ (Egress/E-W) │
└─────────────────┘ └────────────────┘
Key design principles:
- Blast radius isolation — each workload, environment, and function lives in its own AWS account
- Centralized governance — SCPs, tag policies, and Config rules enforced at the organization level
- Shared networking — Transit Gateway provides connectivity without VPC peering sprawl
- Immutable logging — all logs flow to a dedicated Log Archive account with S3 Object Lock
- CI/CD as a first-class citizen — dedicated DevTools account with cross-account deployment roles
- Multi-region resilience — Transit Gateway inter-region peering with DR region for business continuity
- Hybrid connectivity — Direct Connect (primary) + Site-to-Site VPN (backup) via Direct Connect Gateway reaching both regions
Multi-Account Strategy & OU Structure
Organizational Units (OUs)
Root
├── Security OU
│ ├── Log Archive — CloudTrail, Config, VPC Flow Logs (immutable, S3 Object Lock)
│ ├── Security Tooling — GuardDuty delegated admin, Security Hub, Inspector, Macie
│ └── Audit — Read-only cross-account access for auditors & compliance
│
├── Infrastructure OU
│ ├── Network Hub — Transit Gateway, Direct Connect, VPN, DNS, Network Firewall
│ ├── Shared Services — Managed AD, internal tools, golden AMI pipeline, IPAM
│ └── Backup — AWS Backup central vault, cross-account backup policies
│
├── Sandbox OU
│ └── Sandbox-{user} — Experimentation (aggressive SCPs, budget caps, auto-nuke)
│
├── Workloads OU
│ ├── Dev OU
│ │ ├── App-A-Dev
│ │ └── App-B-Dev
│ ├── Staging OU
│ │ ├── App-A-Staging
│ │ └── App-B-Staging
│ └── Prod OU
│ ├── App-A-Prod
│ └── App-B-Prod
│
├── DevTools OU
│ └── CI/CD — CodePipeline, CodeBuild, ECR, CodeArtifact
│
└── Suspended OU — Decommissioned accounts (deny-all SCP attached)
Key Service Control Policies (SCPs)
| SCP | Attached To | What It Does |
|---|---|---|
| Deny root user actions | Root OU | Blocks all actions by the root user across all accounts |
| Restrict regions | Root OU | Denies API calls outside approved regions (e.g., us-east-1, us-west-2) |
| Require IMDSv2 | Root OU | Blocks EC2 launches that don't enforce Instance Metadata Service v2 |
| Deny leaving organization | Root OU | Prevents any account from removing itself from the org |
| Deny S3 public access | Root OU | Blocks PutBucketPolicy and PutBucketAcl that grant public access |
| Deny untagged resources | Workloads OU, DevTools OU | Blocks resource creation without required tags |
| Deny expensive services | Sandbox OU | Blocks Redshift, EMR, SageMaker large instances, etc. |
| Deny VPC peering | Sandbox OU | Prevents sandbox accounts from connecting to other networks |
| Deny all | Suspended OU | Complete lockout — only billing access remains |
| Protect log archive | Security OU | Deny s3:DeleteObject, s3:PutBucketPolicy on log buckets |
AWS Well-Architected Framework Alignment
Every component of this landing zone maps to one or more of the six pillars of the AWS Well-Architected Framework.
Pillar 1: Operational Excellence
The ability to support development and run workloads effectively, gain insight into operations, and continuously improve processes and procedures.
| Best Practice | Implementation |
|---|---|
| Perform operations as code | All infrastructure managed via Terraform/CloudFormation; no manual console changes |
| Make frequent, small, reversible changes | CI/CD pipelines with blue/green and canary deployments |
| Refine operations procedures frequently | Runbooks in Systems Manager; post-incident reviews |
| Anticipate failure | GameDays, chaos engineering with AWS Fault Injection Service |
| Learn from all operational events | CloudTrail + CloudWatch Logs Insights for incident analysis |
Services: AWS Systems Manager, CloudFormation/Terraform, CloudWatch, AWS Health, Trusted Advisor
Pillar 2: Security
The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies.
| Best Practice | Implementation |
|---|---|
| Implement a strong identity foundation | IAM Identity Center with corporate IdP; least-privilege permission sets; no IAM users for humans |
| Enable traceability | Org-wide CloudTrail; VPC Flow Logs; DNS query logging; S3 access logs |
| Apply security at all layers | SCPs at org level; security groups at instance level; WAF at edge; Network Firewall at VPC level |
| Automate security best practices | Config Rules with auto-remediation; GuardDuty auto-response via EventBridge + Lambda |
| Protect data in transit and at rest | KMS encryption for EBS, S3, RDS; ACM for TLS; VPN/Direct Connect for hybrid |
| Prepare for security events | Security Hub aggregation; Detective for investigation; incident response runbooks in SSM |
Services: IAM Identity Center, GuardDuty, Security Hub, Inspector, Macie, KMS, WAF, Shield, Network Firewall, CloudTrail, AWS Config
Pillar 3: Reliability
The ability of a workload to perform its intended function correctly and consistently when it's expected to.
| Best Practice | Implementation |
|---|---|
| Automatically recover from failure | Auto Scaling groups; multi-AZ RDS/Aurora; Route 53 failover routing for multi-region DR |
| Test recovery procedures | AWS Backup with periodic restore testing; DR runbooks; scheduled DR failover drills |
| Scale horizontally | ECS/EKS with Fargate; ALB for load distribution |
| Manage change in automation | IaC-only changes; drift detection; approval gates in CI/CD |
| Monitor and alarm | CloudWatch alarms on key metrics; composite alarms; EventBridge rules for infrastructure state changes |
| Plan for disaster recovery | Tiered DR strategy (active-active for Tier 1, pilot light for Tier 3); Aurora Global Database; TGW inter-region peering |
| Use fault isolation boundaries | Multi-region architecture; multi-AZ within each region; separate blast radius per account |
Services: Auto Scaling, ELB, Route 53 (health checks + failover routing), AWS Backup (cross-region vaults), S3 cross-region replication, Aurora Global Database, Transit Gateway inter-region peering, Direct Connect + VPN redundancy
Pillar 4: Performance Efficiency
The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes.
| Best Practice | Implementation |
|---|---|
| Use serverless architectures | Lambda for event processing; Fargate for containers; Aurora Serverless for variable DB loads |
| Go global in minutes | CloudFront for content delivery; Route 53 latency-based routing |
| Experiment more often | Sandbox OU with budget caps for rapid experimentation |
| Use the right resource type | Compute Optimizer recommendations; rightsizing via Cost Explorer |
| Monitor performance | CloudWatch Container Insights for ECS/EKS; X-Ray for distributed tracing |
Services: Lambda, Fargate, CloudFront, Compute Optimizer, X-Ray, CloudWatch
Pillar 5: Cost Optimization
The ability to run systems to deliver business value at the lowest price point.
| Best Practice | Implementation |
|---|---|
| Implement cloud financial management | CUR + Athena + QuickSight for chargeback dashboards; dedicated FinOps team |
| Adopt a consumption model | Auto Scaling; Lambda pay-per-invocation; Fargate Spot |
| Measure overall efficiency | Cost-per-transaction metrics; cost allocation by tag (Department, CostCenter, Project) |
| Stop spending money on undifferentiated heavy lifting | Managed services (RDS over self-managed DB, EKS over self-managed K8s) |
| Analyze and attribute expenditure | Mandatory cost allocation tags; per-account budgets with anomaly detection |
Services: Cost Explorer, AWS Budgets, Cost Anomaly Detection, CUR, Savings Plans, Compute Optimizer, S3 Intelligent-Tiering
Pillar 6: Sustainability
The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components.
| Best Practice | Implementation |
|---|---|
| Understand your impact | AWS Customer Carbon Footprint Tool dashboard |
| Maximize utilization | Auto Scaling to avoid over-provisioned idle resources; Spot instances for batch |
| Use managed services | Shared infrastructure (Lambda, Fargate, Aurora Serverless) is more efficient than dedicated EC2 |
| Reduce downstream impact | S3 lifecycle policies to move cold data to Glacier; delete unused EBS snapshots |
Services: Customer Carbon Footprint Tool, Graviton (ARM) instances, S3 Intelligent-Tiering, Spot Instances
Centralized Logging & Access Tracking
All logs flow into the Log Archive account in the Security OU. This account has a protective SCP that denies deletion of log data.
Log Types and Sources
| Log Type | AWS Service | What It Captures | Destination |
|---|---|---|---|
| API Activity | CloudTrail (Org Trail) | Every API call — who created/modified/deleted any resource, from which IP, with which role | S3 (Log Archive) + CloudWatch Logs |
| Resource Configuration | AWS Config | Configuration timeline of every resource — before/after snapshots | S3 (Log Archive) + Config Aggregator |
| Network Traffic | VPC Flow Logs | Accepted/rejected flows — source/dest IP, port, bytes, action | S3 (Log Archive) + CloudWatch Logs |
| DNS Queries | Route 53 Resolver Query Logging | Every DNS query from VPCs — domain, source IP, response | S3 (Log Archive) |
| S3 Data Access | S3 Server Access Logging + CloudTrail Data Events | Who accessed which bucket/object, when, from where | S3 (Log Archive) |
| SSO/Login Activity | IAM Identity Center + CloudTrail | Who logged in, which account, which permission set, MFA status | CloudTrail → S3 |
| Load Balancer | ALB/NLB Access Logs | Client IP, latency, status codes, target group | S3 (Log Archive) |
| Firewall | AWS Network Firewall Logs | Allowed/denied traffic through stateful and stateless rules | S3 + CloudWatch |
| WAF | AWS WAF Logs | Web request inspection — blocked/allowed, rule matches | S3 / Kinesis Firehose |
| Database | RDS/Aurora Audit Logs | SQL queries, login attempts, schema changes | CloudWatch Logs → S3 |
| Container | ECS/EKS + Container Insights | Application stdout/stderr, K8s audit logs, resource metrics | CloudWatch Logs → S3 |
| Lambda | CloudWatch Logs (automatic) | Invocations, duration, errors, cold starts | CloudWatch Logs → S3 |
| Cost Events | Cost & Usage Report (CUR) | Hourly cost per resource, with tags | S3 (Billing account) |
Log Retention & Lifecycle
| Tier | Retention Period | Storage Class | Purpose |
|---|---|---|---|
| Hot | 0 – 90 days | S3 Standard | Active investigation, real-time queries |
| Warm | 90 – 365 days | S3 Glacier Instant Retrieval | Compliance queries, incident lookback |
| Cold | 1 – 7 years | S3 Glacier Deep Archive | Regulatory retention (HIPAA: 6 yr, SOX: 7 yr) |
Immutability: S3 Object Lock (WORM) is enabled on all log buckets. Even administrators cannot delete or overwrite log data during the retention period.
Log Analysis Stack
| Use Case | Tool | Description |
|---|---|---|
| Real-time queries | CloudWatch Logs Insights | Sub-second queries on recent logs |
| Ad-hoc investigation | Amazon Athena | SQL queries against S3 log partitions |
| Security correlation | Security Hub + Amazon Security Lake | Aggregated findings with OCSF normalization |
| Dashboards | CloudWatch Dashboards / Managed Grafana | Operational and security dashboards |
| Long-term SIEM | Splunk / Datadog / Elastic (optional) | For enterprises with existing SIEM investments |
Mandatory Tagging & Deployment Accountability
Tag Schema
Every resource deployed in this environment must carry the following tags. Resources without required tags are blocked at creation time by SCPs.
Required Tags
| Tag Key | Example Value | Purpose |
|---|---|---|
Department | Engineering, Finance, Security | Which team/department owns this resource |
CostCenter | CC-1234 | Financial cost center for chargeback |
Owner | jsmith@company.com | Individual who provisioned/owns the resource |
Manager | mjones@company.com | Manager of the owner — escalation & approval audit |
Environment | dev, staging, prod | Lifecycle stage |
Project | ProjectAlpha | Which project this resource belongs to |
DeployedBy | ci/codepipeline, jsmith-manual | How the resource was deployed |
Recommended Tags
| Tag Key | Example Value | Purpose |
|---|---|---|
DeployPipelineId | pipeline-abc123 | Links resource to exact CI/CD pipeline execution |
Application | WebApp, DataPipeline | Application name for resource grouping |
DataClassification | Public, Internal, Confidential | Data sensitivity level |
Compliance | HIPAA, SOC2, PCI | Applicable compliance framework |
ExpirationDate | 2026-06-30 | Auto-cleanup for temporary resources |
Four Layers of Tag Enforcement
Layer 1: IaC Default Tags → Terraform provider default_tags / CFN resource tags
│ (applied to every resource in the pipeline)
▼
Layer 2: CI/CD Validation → Pipeline step validates all required tags before deploy
│ (fails the build if tags are missing)
▼
Layer 3: SCP Enforcement → Organization SCP denies Create* APIs without tags
│ (catches manual console deployments)
▼
Layer 4: Config Rule Detection → AWS Config required-tags rule + auto-remediation
(detects tag drift, notifies owner, quarantines if needed)
Auto-Tagging for Manual Deployments
Even with all the above, someone will eventually create a resource via the console. To catch this:
- EventBridge rule triggers on CloudTrail
Create*/RunInstances/CreateDBInstanceevents - Lambda function reads the CloudTrail event and auto-tags the resource with:
CreatedBy= IAM principal ARN from the eventCreatedAt= event timestampCreatedVia=console/cli/sdk/terraform(derived from user agent)
- If required tags are still missing after 48 hours → SNS notification to team lead + Config non-compliant finding
Full Deployment Audit Trail
For every resource in your environment, you can answer: Who deployed it, when, how, for which project, under which cost center, approved by whom?
Git Commit (author + SHA + PR reviewer)
→ Pipeline Trigger (pipeline ID + source branch)
→ Approval Gate (who approved for staging/prod)
→ Deploy (tags: DeployedBy, PipelineId, CommitSHA)
→ CloudTrail (immutable API-level audit log)
→ AWS Config (configuration timeline with tags)
Cost Allocation, Monitoring & Chargeback
Activating Cost Allocation Tags
In the Billing console of the management account, activate these tags as cost allocation tags:
DepartmentCostCenterProjectOwnerEnvironment
Note: Tags only appear in billing data after activation. Historical data before activation is not retroactively tagged. Activate on day one.
Cost & Usage Report (CUR)
| Setting | Configuration |
|---|---|
| Report name | enterprise-cur |
| Time granularity | Hourly |
| Format | Apache Parquet |
| Compression | Parquet (columnar, efficient for Athena) |
| S3 bucket | Dedicated bucket in management or billing account |
| Integration | Athena, QuickSight, Redshift |
| Resource-level data | Enabled (includes individual resource IDs) |
| Tag columns | All activated cost allocation tags included |
Budget Alerts
| Budget Type | Scope | Thresholds | Action |
|---|---|---|---|
| Per-account monthly | Each linked account | 50%, 80%, 100% of budget | SNS notification to account owner + finance |
| Per-cost-center | Filter by CostCenter tag | 80%, 100% | SNS to cost center owner |
| Per-project | Filter by Project tag | 80%, 100% | SNS to project lead |
| Anomaly detection | Per linked account | Auto-detected anomalies | SNS + optional Lambda to stop non-prod instances |
Chargeback Pipeline
Tagged Resources → CUR to S3 (hourly) → Athena (GROUP BY CostCenter, Department)
→ QuickSight Dashboard (monthly chargeback by team)
→ Automated PDF reports emailed to cost center owners
Example Athena Query — Cost by Department
SELECT line_item_product_code AS service, resource_tags_user_department AS department, resource_tags_user_cost_center AS cost_center, SUM(line_item_unblended_cost) AS total_cost FROM cur_database.cur_table WHERE month = '4' AND year = '2026' GROUP BY 1, 2, 3 ORDER BY total_cost DESC LIMIT 50;
Services by Team
Infrastructure Team
| Category | Services |
|---|---|
| Compute | EC2, ECS, EKS, Lambda, Auto Scaling Groups |
| Networking | VPC, Transit Gateway, Route 53, CloudFront, ALB/NLB, VPC IPAM |
| Storage | S3, EBS, EFS, FSx for Lustre / Windows |
| Database | RDS, Aurora, DynamoDB, ElastiCache, MemoryDB |
| Hybrid | Direct Connect, Site-to-Site VPN, AWS Outposts |
| Operations | Systems Manager, Patch Manager, AWS Backup, AWS Health |
| Monitoring | CloudWatch, X-Ray, Managed Grafana, Managed Prometheus |
| IaC | Terraform, CloudFormation, Service Catalog, CDK |
Developer / DevOps Team
| Category | Services |
|---|---|
| Source Control | GitHub / GitLab integration (CodeCommit is deprecated) |
| CI/CD | CodePipeline + CodeBuild + CodeDeploy |
| Container Registry | Amazon ECR |
| Package Management | CodeArtifact (npm, Maven, pip) |
| IDE | Cloud9, VS Code with Amazon Q Developer |
| Security Scanning | CodeGuru Reviewer, Inspector (container images), Snyk integration |
| Testing | CodeBuild + testing frameworks, AWS Device Farm |
| Infrastructure Pipeline | Terraform Cloud / Atlantis / CodePipeline for IaC |
Security Team
| Category | Services |
|---|---|
| Identity & Access | IAM Identity Center, AWS Organizations SCPs, Permission Boundaries |
| Threat Detection | GuardDuty, Security Hub, Amazon Detective, Macie |
| Network Protection | AWS WAF, Shield Advanced, Network Firewall |
| Encryption | KMS (multi-region keys), ACM, CloudHSM |
| Compliance | AWS Config Rules, Audit Manager, Security Lake |
| Incident Response | EventBridge → Step Functions → Lambda automation |
CI/CD Pipeline Architecture
Application CI/CD
GitHub (webhook)
→ CodePipeline
→ Source Stage: pull code + resolve dependencies
→ Build Stage: CodeBuild
• Docker build
• Unit tests + integration tests
• SAST scanning (CodeGuru Reviewer)
• Container image scan (Inspector)
→ Artifact Stage: push to ECR / CodeArtifact
→ Deploy Dev: auto-deploy to ECS/EKS dev (blue/green)
→ Manual Approval: required for staging and prod
→ Deploy Staging: deploy + smoke tests
→ Manual Approval: prod gate
→ Deploy Prod: canary or blue/green via CodeDeploy
Infrastructure CI/CD
Git push (Terraform code)
→ CodePipeline
→ Source Stage: pull IaC repo
→ Plan Stage: CodeBuild runs `terraform plan`
• Plan output posted as PR comment or artifact
• Tag validation: check all resources have required tags
• Cost estimation: Infracost or tfcost
→ Manual Approval: review plan output
→ Apply Stage: CodeBuild runs `terraform apply`
→ Drift Detection: scheduled `terraform plan` (no apply) to detect drift
Cross-Account Deployment Pattern
The CI/CD account (DevTools OU) assumes roles in target workload accounts:
CI/CD Account (DevTools OU)
│
├── AssumeRole → Dev Account (CodePipelineDeployRole)
├── AssumeRole → Staging Account (CodePipelineDeployRole)
└── AssumeRole → Prod Account (CodePipelineDeployRole)
Each CodePipelineDeployRole has:
- Least-privilege permissions scoped to the specific services being deployed
- Trust policy limited to the CI/CD account
- External ID for additional security
- CloudTrail logging of every
AssumeRolecall
Infrastructure as Code — Terraform Starter
Recommended Directory Structure
terraform-landing-zone/
├── modules/
│ ├── organization/ # AWS Organizations, OUs, SCPs, Tag Policies
│ │ ├── main.tf
│ │ ├── ous.tf
│ │ ├── scps.tf
│ │ ├── tag-policies.tf
│ │ └── variables.tf
│ ├── networking/ # Transit Gateway, VPCs, Subnets, IPAM
│ │ ├── main.tf
│ │ ├── transit-gw.tf
│ │ ├── vpc.tf
│ │ ├── network-firewall.tf
│ │ └── variables.tf
│ ├── security/ # GuardDuty, Security Hub, Inspector, Config
│ │ ├── guardduty.tf
│ │ ├── security-hub.tf
│ │ ├── config.tf
│ │ └── inspector.tf
│ ├── identity/ # IAM Identity Center, Permission Sets
│ │ ├── sso.tf
│ │ └── permission-sets.tf
│ ├── logging/ # CloudTrail org trail, S3 log archive, VPC Flow Logs
│ │ ├── cloudtrail.tf
│ │ ├── s3-log-archive.tf
│ │ ├── vpc-flow-logs.tf
│ │ └── config-recorder.tf
│ ├── governance/ # Tag policies, SCPs, Config Rules, auto-tagger Lambda
│ │ ├── tag-policy.tf
│ │ ├── scp-require-tags.tf
│ │ ├── config-rules.tf
│ │ └── auto-tagger.tf
│ ├── monitoring/ # CloudWatch alarms, dashboards, SNS topics
│ │ ├── alarms.tf
│ │ ├── dashboards.tf
│ │ └── sns-topics.tf
│ ├── cost/ # Budgets, CUR, anomaly detection
│ │ ├── budgets.tf
│ │ ├── cur.tf
│ │ └── anomaly-detection.tf
│ └── ...
│
├── environments/
│ ├── management/ # Management account bootstrap
│ ├── security/ # Security tooling account
│ ├── network/ # Network hub account
│ ├── shared-services/ # AD, internal tools
│ ├── cicd/ # DevTools account
│ ├── dev/ # Workload dev
│ ├── staging/ # Workload staging
│ └── prod/ # Workload prod
│
├── aft-config/ # Account Factory for Terraform
│ ├── account-request/ # New account definitions
│ ├── account-customizations/ # Per-account Terraform
│ └── global-customizations/ # Applied to all new accounts
│
└── pipelines/
├── buildspec-plan.yml # CodeBuild: terraform plan
└── buildspec-apply.yml # CodeBuild: terraform apply
Key Terraform: Provider Default Tags
# environments/{env}/main.tf # Every resource in this environment automatically inherits these tags provider "aws" { region = var.region default_tags { tags = { Environment = var.environment # "dev", "staging", "prod" Department = var.department # "Engineering" CostCenter = var.cost_center # "CC-1234" Owner = var.deployer_email # "jsmith@company.com" Manager = var.manager_email # "mjones@company.com" Project = var.project_name # "ProjectAlpha" DeployedBy = "ci/terraform" DeployPipelineId = var.pipeline_execution_id ManagedBy = "terraform" } } }
Key Terraform: Organization & SCPs
# modules/organization/main.tf resource "aws_organizations_organization" "org" { aws_service_access_principals = [ "controltower.amazonaws.com", "sso.amazonaws.com", "config-multiaccountsetup.amazonaws.com", "guardduty.amazonaws.com", "securityhub.amazonaws.com", "cloudtrail.amazonaws.com", "tagpolicies.tag.amazonaws.com", "backup.amazonaws.com", ] feature_set = "ALL" enabled_policy_types = ["SERVICE_CONTROL_POLICY", "TAG_POLICY"] } # SCP: Deny resource creation without required tags resource "aws_organizations_policy" "require_tags" { name = "require-mandatory-tags" content = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "DenyEC2WithoutTags" Effect = "Deny" Action = ["ec2:RunInstances"] Resource = ["arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:volume/*"] Condition = { "Null" = { "aws:RequestTag/Department" = "true" "aws:RequestTag/CostCenter" = "true" "aws:RequestTag/Owner" = "true" "aws:RequestTag/Manager" = "true" "aws:RequestTag/Environment" = "true" } } } ] }) }
Key Terraform: CloudTrail Org Trail with Immutable S3
# modules/logging/cloudtrail.tf resource "aws_cloudtrail" "org_trail" { name = "enterprise-org-trail" s3_bucket_name = aws_s3_bucket.log_archive.id is_organization_trail = true is_multi_region_trail = true enable_log_file_validation = true kms_key_id = aws_kms_key.log_encryption.arn cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail.arn}:*" cloud_watch_logs_role_arn = aws_iam_role.cloudtrail_cw.arn event_selector { read_write_type = "All" include_management_events = true data_resource { type = "AWS::S3::Object" values = ["arn:aws:s3"] } } } # Immutable log bucket resource "aws_s3_bucket" "log_archive" { bucket = "enterprise-log-archive-${data.aws_caller_identity.current.account_id}" object_lock_enabled = true } resource "aws_s3_bucket_lifecycle_configuration" "log_lifecycle" { bucket = aws_s3_bucket.log_archive.id rule { id = "log-tiering" status = "Enabled" transition { days = 90; storage_class = "GLACIER_IR" } transition { days = 365; storage_class = "DEEP_ARCHIVE" } } }
Key Terraform: AWS Config — Tag Compliance
# modules/governance/config-rules.tf resource "aws_config_config_rule" "required_tags" { name = "required-tags-check" source { owner = "AWS" source_identifier = "REQUIRED_TAGS" } input_parameters = jsonencode({ tag1Key = "Department" tag2Key = "CostCenter" tag3Key = "Owner" tag4Key = "Manager" tag5Key = "Environment" tag6Key = "DeployedBy" }) scope { compliance_resource_types = [ "AWS::EC2::Instance", "AWS::RDS::DBInstance", "AWS::S3::Bucket", "AWS::Lambda::Function", "AWS::ElasticLoadBalancingV2::LoadBalancer", ] } }
Key Terraform: Budget Alerts
# modules/cost/budgets.tf resource "aws_budgets_budget" "account_monthly" { name = "account-monthly-${var.account_name}" budget_type = "COST" limit_amount = var.monthly_budget_limit limit_unit = "USD" time_unit = "MONTHLY" notification { comparison_operator = "GREATER_THAN" threshold = 80 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_email_addresses = [var.budget_alert_email] } notification { comparison_operator = "GREATER_THAN" threshold = 100 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_email_addresses = [var.budget_alert_email, var.finance_email] } } resource "aws_ce_anomaly_monitor" "account" { name = "account-anomaly-${var.account_name}" monitor_type = "DIMENSIONAL" monitor_dimension = "SERVICE" }
Multi-Region Architecture & Hybrid Connectivity
This section covers the multi-region network topology with on-premises connectivity using Direct Connect (primary) and Site-to-Site VPN (backup), all routed through Transit Gateway.
Network Topology
On-Premises Data Center(s)
┌──────────────────────────────────────────┐
│ Corporate Network │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ Customer │ │ Customer │ │
│ │ Router (DX) │ │ Router (VPN) │ │
│ └──────┬──────┘ └───────┬─────────┘ │
└─────────┼───────────────────┼────────────┘
│ Primary │ Backup
┌─────────▼─────────┐ ┌──────▼──────────┐
│ AWS Direct │ │ AWS Site-to- │
│ Connect │ │ Site VPN │
│ (1 or 10 Gbps) │ │ (IPsec, ECMP) │
└─────────┬─────────┘ └──────┬──────────┘
│ │
┌─────────▼───────────────────▼──────────┐
│ Direct Connect Gateway │
│ (Global — not region-specific) │
└────────┬────────────────────┬──────────┘
│ │
┌──────────────────▼──────┐ ┌─────────▼───────────────────┐
│ PRIMARY REGION │ │ DR REGION │
│ (e.g., us-east-1) │ │ (e.g., us-west-2) │
│ │ │ │
│ ┌───────────────────┐ │ │ ┌───────────────────────┐ │
│ │ Transit Gateway │◄─┼──┼─►│ Transit Gateway │ │
│ │ (Primary) │ │ │ │ (DR) │ │
│ └──┬──┬──┬──┬─────┘ │ │ └──┬──┬──┬─────────────┘ │
│ │ │ │ │ │ │ │ │ │ │
│ VPCs: │ │ │ │ │ VPCs: │ │ │
│ Prod Dev CI/CD NW │ │ Prod Shared NW │
│ Firewall │ │ Firewall │
│ │ │ │
└───────────────────────┘ └────────────────────────────┘
│ │
└──── TGW Inter-Region ──────┘
Peering
Direct Connect — Primary Path
| Setting | Recommendation |
|---|---|
| Connection type | Dedicated connection (1 Gbps or 10 Gbps) for production; Hosted connection for smaller bandwidth |
| Redundancy | Two connections from different Direct Connect locations (e.g., one from EqDC2, one from CoreSite) for high availability |
| Direct Connect Gateway | Attach the DX Gateway to Transit Gateways in both primary and DR regions — a single DX connection reaches both regions |
| Virtual Interface | Transit Virtual Interface (Transit VIF) → Direct Connect Gateway → Transit Gateway |
| BGP | Private ASN on-prem; advertise on-prem routes; receive AWS VPC routes via BGP propagation |
| Encryption | MACsec (Layer 2 encryption) on 10 Gbps dedicated connections — or run IPsec VPN over the DX connection for in-transit encryption |
| Monitoring | CloudWatch metrics: ConnectionState, ConnectionBpsEgress, ConnectionBpsIngress — alarm on state change |
Site-to-Site VPN — Backup Path (Optional)
| Setting | Recommendation |
|---|---|
| Purpose | Failover path if Direct Connect goes down; also useful for initial setup while DX is being provisioned (DX can take weeks) |
| Attachment | VPN attached to the same Transit Gateway as the DX |
| ECMP | Enable ECMP on Transit Gateway for multiple VPN tunnels — increases aggregate bandwidth (each tunnel = ~1.25 Gbps) |
| Routing | BGP with lower priority (longer AS path or lower local preference) so traffic prefers DX when available |
| Encryption | IPsec — AES-256, SHA-256, DH Group 20+ |
| Accelerated VPN | Enable AWS Global Accelerator for VPN to reduce latency and jitter over public internet |
| Monitoring | CloudWatch metrics: TunnelState, TunnelDataIn, TunnelDataOut — alarm when tunnels go down |
Transit Gateway — Regional Hub
| Setting | Recommendation |
|---|---|
| Route tables | Segmented route tables: one for prod VPCs, one for non-prod, one for shared services — prevents dev from reaching prod directly |
| Inter-region peering | TGW peering between primary (us-east-1) and DR (us-west-2) regions — encrypted, runs over AWS backbone (not public internet) |
| Route propagation | On-prem routes propagate from DX/VPN attachment to all TGW route tables; VPC routes propagate to the on-prem route table |
| Blackhole routes | Add blackhole routes for denied traffic (e.g., sandbox OU CIDRs should not reach on-prem) |
| Network Firewall | Inspection VPC in the Network Hub account — all east-west and egress traffic routed through AWS Network Firewall |
| Flow Logs | TGW Flow Logs enabled → S3 (Log Archive account) for traffic analysis between all attachments |
| Sharing | Share the TGW via AWS RAM to all workload accounts in the organization |
DNS Resolution (Hybrid)
| Component | Configuration |
|---|---|
| Route 53 Private Hosted Zones | One per domain (e.g., internal.company.com), shared via RAM to all workload accounts |
| Route 53 Resolver Inbound Endpoints | In the Network Hub VPC — allows on-prem DNS servers to resolve AWS private domains |
| Route 53 Resolver Outbound Endpoints | In the Network Hub VPC — allows AWS resources to resolve on-prem DNS domains |
| Resolver Rules | Forward rules for on-prem domains (e.g., *.corp.company.com → on-prem DNS servers) shared via RAM |
| Query Logging | All DNS queries logged to S3 (Log Archive) and CloudWatch Logs |
Security & Logging for Hybrid Connectivity
All hybrid traffic adheres to the same security and logging standards as intra-AWS traffic:
| Control | Implementation |
|---|---|
| Encryption in transit | DX: MACsec or IPsec overlay; VPN: IPsec (always encrypted) |
| Network Firewall | All traffic between on-prem and VPCs passes through the Network Firewall inspection VPC |
| TGW Flow Logs | Captures all traffic crossing the Transit Gateway — source/dest, bytes, action |
| VPC Flow Logs | Per-VPC flow logs in every workload account → S3 (Log Archive) |
| CloudTrail | All networking API calls logged (CreateVpnConnection, CreateTransitGatewayPeeringAttachment, etc.) |
| DX/VPN monitoring | CloudWatch alarms on ConnectionState (DX) and TunnelState (VPN) — SNS alert on failover |
| Route 53 query logs | All DNS queries logged — detect unauthorized DNS resolution attempts |
| AWS Config | Tracks changes to TGW route tables, VPN configs, security groups, NACLs |
Terraform — Direct Connect + VPN + Transit Gateway
# modules/networking/transit-gw.tf # Primary region Transit Gateway resource "aws_ec2_transit_gateway" "primary" { description = "Enterprise TGW - Primary Region" amazon_side_asn = 64512 auto_accept_shared_attachments = "disable" default_route_table_association = "disable" default_route_table_propagation = "disable" dns_support = "enable" transit_gateway_cidr_blocks = [var.tgw_cidr] tags = { Name = "enterprise-tgw-primary" } } # Share TGW via RAM resource "aws_ram_resource_share" "tgw_share" { name = "tgw-org-share" allow_external_principals = false } resource "aws_ram_resource_association" "tgw" { resource_arn = aws_ec2_transit_gateway.primary.arn resource_share_arn = aws_ram_resource_share.tgw_share.arn } resource "aws_ram_principal_association" "org" { principal = aws_organizations_organization.org.arn resource_share_arn = aws_ram_resource_share.tgw_share.arn } # TGW Route Tables — segmented resource "aws_ec2_transit_gateway_route_table" "prod" { transit_gateway_id = aws_ec2_transit_gateway.primary.id tags = { Name = "tgw-rt-prod" } } resource "aws_ec2_transit_gateway_route_table" "non_prod" { transit_gateway_id = aws_ec2_transit_gateway.primary.id tags = { Name = "tgw-rt-non-prod" } } resource "aws_ec2_transit_gateway_route_table" "shared" { transit_gateway_id = aws_ec2_transit_gateway.primary.id tags = { Name = "tgw-rt-shared-services" } } # TGW Flow Logs resource "aws_ec2_transit_gateway_flow_log" "tgw_flow" { transit_gateway_id = aws_ec2_transit_gateway.primary.id log_destination = aws_s3_bucket.log_archive.arn log_destination_type = "s3" traffic_type = "ALL" max_aggregation_interval = 60 tags = { Name = "tgw-flow-logs" } }
# modules/networking/direct-connect.tf # Direct Connect Gateway (global resource) resource "aws_dx_gateway" "main" { name = "enterprise-dx-gateway" amazon_side_asn = "64513" } # Associate DX Gateway with Primary TGW resource "aws_dx_gateway_association" "primary" { dx_gateway_id = aws_dx_gateway.main.id associated_gateway_id = aws_ec2_transit_gateway.primary.id allowed_prefixes = var.aws_cidr_blocks # CIDRs to advertise to on-prem } # Associate DX Gateway with DR TGW (multi-region) resource "aws_dx_gateway_association" "dr" { provider = aws.dr_region dx_gateway_id = aws_dx_gateway.main.id associated_gateway_id = aws_ec2_transit_gateway.dr.id allowed_prefixes = var.aws_cidr_blocks_dr }
# modules/networking/vpn-backup.tf # Customer Gateway (on-prem router) resource "aws_customer_gateway" "onprem" { bgp_asn = var.onprem_bgp_asn # e.g., 65000 ip_address = var.onprem_public_ip type = "ipsec.1" tags = { Name = "onprem-cgw" } } # Site-to-Site VPN attached to Transit Gateway resource "aws_vpn_connection" "backup" { customer_gateway_id = aws_customer_gateway.onprem.id transit_gateway_id = aws_ec2_transit_gateway.primary.id type = "ipsec.1" static_routes_only = false # Use BGP enable_acceleration = true # Global Accelerator for VPN tunnel1_inside_cidr = var.tunnel1_cidr tunnel2_inside_cidr = var.tunnel2_cidr tags = { Name = "onprem-backup-vpn" } } # CloudWatch alarm — VPN tunnel down resource "aws_cloudwatch_metric_alarm" "vpn_tunnel_down" { alarm_name = "vpn-tunnel-down" comparison_operator = "LessThanThreshold" evaluation_periods = 2 metric_name = "TunnelState" namespace = "AWS/VPN" period = 300 statistic = "Maximum" threshold = 1 alarm_description = "VPN tunnel is down" alarm_actions = [aws_sns_topic.infra_alerts_critical.arn] dimensions = { VpnId = aws_vpn_connection.backup.id } }
# modules/networking/tgw-peering.tf # Inter-region TGW peering (primary ↔ DR) resource "aws_ec2_transit_gateway_peering_attachment" "primary_to_dr" { peer_region = var.dr_region peer_transit_gateway_id = aws_ec2_transit_gateway.dr.id transit_gateway_id = aws_ec2_transit_gateway.primary.id tags = { Name = "tgw-peering-primary-to-dr" } } # Accept the peering in DR region resource "aws_ec2_transit_gateway_peering_attachment_accepter" "dr" { provider = aws.dr_region transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.primary_to_dr.id tags = { Name = "tgw-peering-dr-accept" } } # Route on-prem traffic to DR region via peering resource "aws_ec2_transit_gateway_route" "dr_to_onprem" { provider = aws.dr_region destination_cidr_block = var.onprem_cidr transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.primary_to_dr.id transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.dr_shared.id }
Business Continuity & Disaster Recovery
DR Strategy Tiers
Not all workloads need the same DR posture. Define tiers based on criticality:
| Tier | Strategy | RPO | RTO | Workload Examples | AWS Implementation |
|---|---|---|---|---|---|
| Tier 1 — Critical | Active-Active (Multi-Region) | Near-zero | < 5 min | Customer-facing APIs, auth services | Aurora Global Database, Route 53 failover, ECS in both regions |
| Tier 2 — Important | Warm Standby | < 15 min | < 30 min | Internal apps, CI/CD | Scaled-down replicas in DR region, AMIs replicated, RDS read replicas |
| Tier 3 — Standard | Pilot Light | < 1 hour | < 4 hours | Batch processing, analytics | Core infra running in DR (networking, DB replicas), compute off |
| Tier 4 — Non-Critical | Backup & Restore | < 24 hours | < 24 hours | Dev/sandbox, archival | S3 cross-region replication, AWS Backup cross-region vaults |
Multi-Region Services
| Service | Multi-Region Capability |
|---|---|
| Aurora | Aurora Global Database — 1 primary region (read/write), up to 5 secondary regions (read-only, < 1 second replication lag); failover promotes secondary to primary |
| DynamoDB | Global Tables — multi-region, multi-active; automatic replication |
| S3 | Cross-Region Replication (CRR) — async replication with optional RTC (Replication Time Control, < 15 min SLA) |
| ECS/EKS | Deploy identical task definitions/deployments in DR region; use Route 53 for traffic steering |
| Lambda | Deploy functions in both regions; no state to replicate |
| Secrets Manager | Multi-region secrets with automatic replication |
| KMS | Multi-region keys — same key material in both regions for seamless encryption/decryption |
| Route 53 | Health checks + failover routing policies — automatic DNS failover |
| AWS Backup | Cross-region backup copies — automated via backup plans |
Route 53 Failover Routing
# modules/dr/route53-failover.tf resource "aws_route53_health_check" "primary_alb" { fqdn = var.primary_alb_dns port = 443 type = "HTTPS" request_interval = 10 failure_threshold = 3 tags = { Name = "primary-region-health-check" } } resource "aws_route53_record" "app_primary" { zone_id = var.hosted_zone_id name = "app.company.com" type = "A" alias { name = var.primary_alb_dns zone_id = var.primary_alb_zone_id evaluate_target_health = true } failover_routing_policy { type = "PRIMARY" } set_identifier = "primary" health_check_id = aws_route53_health_check.primary_alb.id } resource "aws_route53_record" "app_secondary" { zone_id = var.hosted_zone_id name = "app.company.com" type = "A" alias { name = var.dr_alb_dns zone_id = var.dr_alb_zone_id evaluate_target_health = true } failover_routing_policy { type = "SECONDARY" } set_identifier = "secondary" }
Alerting & Change Notifications
AWS recommends a layered alerting architecture using EventBridge as the central event bus, SNS for notification delivery, and CloudWatch Alarms for metric-based thresholds. This provides real-time visibility into changes, security events, cost anomalies, and operational issues.
Alerting Architecture
┌──────────────────────────────────────────────────────────────────┐
│ Event Sources │
│ CloudTrail │ Config │ GuardDuty │ Health │ CloudWatch │ Budgets │
└──────┬──────┴───┬────┴─────┬─────┴───┬────┴─────┬─────┴────┬────┘
│ │ │ │ │ │
▼──────────▼──────────▼─────────▼──────────▼──────────▼
│ Amazon EventBridge (Default Bus) │
│ + Custom Rules per event pattern │
└───┬───────────┬───────────┬───────────┬──────────────┘
│ │ │ │
┌──────▼──┐ ┌─────▼────┐ ┌──▼────┐ ┌───▼──────────┐
│ SNS │ │ Lambda │ │ SQS │ │ AWS Chatbot │
│ Topics │ │ Auto- │ │ Queue │ │ (Slack/Teams)│
│ (email, │ │ remediate│ │ (batch)│ │ │
│ PagerDuty)│ │ │ │ │ │
└─────────┘ └──────────┘ └───────┘ └──────────────┘
EventBridge Rules — What to Alert On
| Event Source | Event Pattern | Alert Severity | Notification Target |
|---|---|---|---|
| CloudTrail | Root user login | 🔴 Critical | Security team SNS + PagerDuty |
| CloudTrail | Console login without MFA | 🔴 Critical | Security team SNS |
| CloudTrail | IAM policy changes (PutRolePolicy, AttachRolePolicy) | 🟡 Warning | Security team SNS |
| CloudTrail | Security group changes (AuthorizeSecurityGroupIngress) | 🟡 Warning | Infra team SNS + Slack |
| CloudTrail | S3 bucket policy changes | 🟡 Warning | Security team SNS |
| CloudTrail | KMS key deletion scheduled | 🔴 Critical | Security team SNS + PagerDuty |
| GuardDuty | HIGH or CRITICAL severity finding | 🔴 Critical | Security team SNS + PagerDuty + Lambda (auto-isolate) |
| AWS Config | Non-compliant resource (missing tags) | 🟡 Warning | Tag violations SNS → team lead |
| AWS Config | Security group open to 0.0.0.0/0 | 🔴 Critical | Security team SNS + Lambda (auto-remediate) |
| AWS Health | Scheduled maintenance or service event | 🟡 Warning | Infra team SNS + Slack |
| Budgets | 80% / 100% threshold breach | 🟡 Warning / 🔴 Critical | Cost alerts SNS → finance + account owner |
| Cost Anomaly Detection | Anomaly detected | 🟡 Warning | Cost alerts SNS → finance |
| CloudWatch Alarm | EC2 StatusCheckFailed | 🔴 Critical | Infra critical SNS + PagerDuty |
| CloudWatch Alarm | RDS CPU > 90% for 10 min | 🟡 Warning | Infra warning SNS + Slack |
| DX/VPN | Connection state change (DX down, VPN tunnel down) | 🔴 Critical | Infra critical SNS + PagerDuty |
| Route 53 | Health check failure (DR failover triggered) | 🔴 Critical | Infra critical SNS + PagerDuty |
AWS Chatbot — Slack/Teams Integration (Recommended)
AWS recommends AWS Chatbot for team-level notifications. It integrates directly with Slack and Microsoft Teams, rendering CloudWatch alarms, Security Hub findings, and EventBridge events as interactive cards.
| Configuration | Setting |
|---|---|
| Slack channel: #infra-alerts | CloudWatch alarms (warning + critical), AWS Health events |
| Slack channel: #security-alerts | GuardDuty findings, Config non-compliance, IAM changes |
| Slack channel: #cost-alerts | Budget breaches, cost anomalies |
| Slack channel: #deploy-notifications | CodePipeline state changes (started, succeeded, failed) |
Terraform — EventBridge Rules & SNS
# modules/alerting/eventbridge-rules.tf # Rule: Root user login resource "aws_cloudwatch_event_rule" "root_login" { name = "detect-root-login" description = "Alert on any root user console login" event_pattern = jsonencode({ source = ["aws.signin"] detail-type = ["AWS Console Sign In via CloudTrail"] detail = { userIdentity = { type = ["Root"] } } }) } resource "aws_cloudwatch_event_target" "root_login_sns" { rule = aws_cloudwatch_event_rule.root_login.name arn = aws_sns_topic.security_alerts.arn } # Rule: Security group opened to the world resource "aws_cloudwatch_event_rule" "sg_open_to_world" { name = "detect-sg-open-to-world" event_pattern = jsonencode({ source = ["aws.ec2"] detail-type = ["AWS API Call via CloudTrail"] detail = { eventName = ["AuthorizeSecurityGroupIngress"] } }) } resource "aws_cloudwatch_event_target" "sg_lambda" { rule = aws_cloudwatch_event_rule.sg_open_to_world.name arn = aws_lambda_function.sg_auto_remediate.arn } # Rule: GuardDuty HIGH/CRITICAL findings resource "aws_cloudwatch_event_rule" "guardduty_high" { name = "guardduty-high-severity" event_pattern = jsonencode({ source = ["aws.guardduty"] detail-type = ["GuardDuty Finding"] detail = { severity = [{ numeric = [">=", 7] }] } }) } resource "aws_cloudwatch_event_target" "guardduty_sns" { rule = aws_cloudwatch_event_rule.guardduty_high.name arn = aws_sns_topic.security_alerts.arn } # Rule: CodePipeline state changes (deploy notifications) resource "aws_cloudwatch_event_rule" "pipeline_state" { name = "codepipeline-state-change" event_pattern = jsonencode({ source = ["aws.codepipeline"] detail-type = ["CodePipeline Pipeline Execution State Change"] detail = { state = ["SUCCEEDED", "FAILED", "CANCELED"] } }) } resource "aws_cloudwatch_event_target" "pipeline_sns" { rule = aws_cloudwatch_event_rule.pipeline_state.name arn = aws_sns_topic.deploy_notifications.arn } # Rule: DX connection state change resource "aws_cloudwatch_event_rule" "dx_state_change" { name = "direct-connect-state-change" event_pattern = jsonencode({ source = ["aws.directconnect"] detail-type = ["Direct Connect Connection State Change"] }) } resource "aws_cloudwatch_event_target" "dx_state_sns" { rule = aws_cloudwatch_event_rule.dx_state_change.name arn = aws_sns_topic.infra_alerts_critical.arn }
# modules/alerting/sns-topics.tf resource "aws_sns_topic" "security_alerts" { name = "security-alerts" kms_master_key_id = aws_kms_key.sns_encryption.id tags = { Name = "security-alerts" } } resource "aws_sns_topic" "infra_alerts_critical" { name = "infra-alerts-critical" kms_master_key_id = aws_kms_key.sns_encryption.id tags = { Name = "infra-alerts-critical" } } resource "aws_sns_topic" "infra_alerts_warning" { name = "infra-alerts-warning" kms_master_key_id = aws_kms_key.sns_encryption.id tags = { Name = "infra-alerts-warning" } } resource "aws_sns_topic" "cost_alerts" { name = "cost-alerts" kms_master_key_id = aws_kms_key.sns_encryption.id tags = { Name = "cost-alerts" } } resource "aws_sns_topic" "deploy_notifications" { name = "deploy-notifications" kms_master_key_id = aws_kms_key.sns_encryption.id tags = { Name = "deploy-notifications" } } # SNS Topic Policy — allow EventBridge to publish resource "aws_sns_topic_policy" "allow_eventbridge" { for_each = toset([ aws_sns_topic.security_alerts.arn, aws_sns_topic.infra_alerts_critical.arn, aws_sns_topic.cost_alerts.arn, aws_sns_topic.deploy_notifications.arn, ]) arn = each.value policy = jsonencode({ Version = "2012-10-17" Statement = [{ Sid = "AllowEventBridge" Effect = "Allow" Principal = { Service = "events.amazonaws.com" } Action = "sns:Publish" Resource = each.value }] }) }
Monitoring & Alerting Built Into Deployment
Every Terraform module in this landing zone includes monitoring resources alongside the infrastructure they monitor. Monitoring is not a follow-up task — it deploys with the resource.
Monitoring-as-Code: What Gets Created With Every Deployment
| Resource Deployed | Monitoring Created Alongside |
|---|---|
| EC2 Instance | CloudWatch alarm: CPU > 85% for 5 min; StatusCheckFailed alarm; disk/memory via CloudWatch Agent |
| RDS Instance | CloudWatch alarms: CPUUtilization, FreeableMemory, DatabaseConnections, ReadLatency, ReplicaLag |
| ALB | CloudWatch alarms: TargetResponseTime > 1s, UnHealthyHostCount > 0, HTTP 5xx rate > 1% |
| ECS Service | Container Insights enabled; alarm on RunningTaskCount < DesiredTaskCount |
| Lambda Function | CloudWatch alarms: Errors > 0, Duration > 80% of timeout, Throttles > 0 |
| S3 Bucket | CloudWatch alarm: 4xxErrors rate; S3 Storage Lens enabled |
| VPC | Flow Logs enabled to S3 + CloudWatch; DNS query logging enabled |
| Any resource | AWS Config recorder running; Config rule: required-tags |
CloudWatch Dashboard — Deployed by Terraform
# modules/monitoring/dashboards.tf resource "aws_cloudwatch_dashboard" "operational" { dashboard_name = "enterprise-operations-${var.environment}" dashboard_body = jsonencode({ widgets = [ { type = "metric" properties = { title = "EC2 CPU Utilization" metrics = [["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", var.asg_name]] period = 300 stat = "Average" } }, { type = "metric" properties = { title = "RDS Connections" metrics = [["AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", var.db_instance]] period = 300 } }, { type = "metric" properties = { title = "ALB Response Time" metrics = [["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", var.alb_arn_suffix]] period = 60 stat = "p99" } } ] }) }
SNS Topics for Alert Routing
| Topic | Subscribers | Triggers |
|---|---|---|
infra-alerts-critical | PagerDuty / OpsGenie integration | EC2 status check failed, RDS failover, ECS task crash loop |
infra-alerts-warning | Team Slack channel + email | CPU > 85%, memory > 90%, disk > 80% |
security-alerts | Security team email + SIEM | GuardDuty HIGH/CRITICAL, Security Hub CRITICAL |
cost-alerts | Finance + account owner email | Budget threshold breach, anomaly detection |
tag-violations | Team lead email | Config rule: non-compliant (missing tags) |
Additional Best Practices & Considerations
Security Hardening
- [ ] Enable MFA on all human IAM Identity Center accounts — enforce in permission set policies
- [ ] Rotate credentials — no long-lived access keys; use IAM roles and short-lived STS tokens
- [ ] Enable AWS Private CA if you need internal TLS certificates at scale
- [ ] VPC endpoints for S3, DynamoDB, CloudWatch, KMS, STS — avoid sending traffic over the internet
- [ ] IMDSv2 required — SCP blocks EC2 launches without
HttpTokens = required - [ ] EBS default encryption — enable account-level default EBS encryption with KMS
- [ ] S3 Block Public Access — enabled at the organization level
Networking
- [ ] Use AWS VPC IPAM for centralized CIDR management — prevents overlaps
- [ ] DNS resolution — Route 53 Private Hosted Zones shared via RAM; Resolver rules for on-prem
- [ ] Egress inspection — AWS Network Firewall in the network hub for all outbound traffic
- [ ] No public subnets in workload accounts (except for ALBs) — use NAT Gateways or centralized egress
Operational Readiness
- [ ] AWS Trusted Advisor — enable organizational view; remediate HIGH findings
- [ ] AWS Health — enable organizational Health events; EventBridge rules for automated response
- [ ] Patch management — Systems Manager Patch Manager with maintenance windows
- [ ] Golden AMI pipeline — EC2 Image Builder → test → approve → share via AWS RAM
- [ ] Backup strategy — AWS Backup with organization-wide backup policies; periodic restore tests
- [ ] Disaster recovery — define RPO/RTO per workload tier; implement pilot light or warm standby for critical workloads
Compliance & Audit
- [ ] AWS Audit Manager — continuous evidence collection for SOC 2, HIPAA, PCI
- [ ] AWS Artifact — download AWS compliance reports (SOC, ISO, PCI)
- [ ] Well-Architected Tool — schedule quarterly Well-Architected Reviews per workload
- [ ] Config Conformance Packs — deploy pre-built rule sets for specific compliance frameworks
Developer Experience
- [ ] Service Catalog — pre-approved resource templates so developers don't need to know Terraform
- [ ] Sandbox accounts — low-friction experimentation with budget caps and auto-cleanup (
aws-nukeon schedule) - [ ] Amazon Q Developer — AI-assisted coding and cloud operations
- [ ] Self-service account vending — AFT or CfCT for teams to request new accounts via PR
Cost Optimization
- [ ] Savings Plans — Compute Savings Plans for predictable EC2/Fargate/Lambda usage
- [ ] Reserved Instances — for stable RDS and ElastiCache workloads
- [ ] Spot Instances — for batch processing, CI/CD build agents, and fault-tolerant workloads
- [ ] S3 Intelligent-Tiering — for buckets with unpredictable access patterns
- [ ] Right-sizing — AWS Compute Optimizer recommendations reviewed monthly
- [ ] Unused resource cleanup — Lambda function scans for unattached EBS volumes, idle EC2, unused EIPs
Phased Rollout Plan
Phase 1: Foundation (Week 1–2)
| Task | Services |
|---|---|
| Enable AWS Organizations + Control Tower | Organizations, Control Tower, IAM Identity Center |
| Create core OUs and accounts (Security, Infrastructure) | Log Archive, Security Tooling, Audit, Network Hub |
| Set up IAM Identity Center with corporate IdP | IAM Identity Center, SAML federation |
| Apply baseline SCPs (deny root, restrict regions, IMDSv2) | Organizations SCPs |
| Enable org-wide CloudTrail to Log Archive | CloudTrail, S3, KMS |
| Enable AWS Config with aggregator | AWS Config |
| Activate cost allocation tags | Billing, Tag Policies |
Phase 2: Networking & Security (Week 2–3)
| Task | Services |
|---|---|
| Deploy Transit Gateway in Network Hub | Transit Gateway, RAM |
| Configure VPC IPAM for CIDR management | VPC IPAM |
| Deploy Network Firewall for egress inspection | Network Firewall |
| Set up Route 53 Private Hosted Zones + Resolver | Route 53 |
| Provision Direct Connect (primary) + Site-to-Site VPN (backup) | Direct Connect, VPN |
| Configure Transit Gateway route tables (prod, non-prod, shared) | Transit Gateway |
| Enable GuardDuty (delegated admin in Security Tooling) | GuardDuty |
| Enable Security Hub with aggregation | Security Hub |
| Deploy Config Rules + tag compliance rules | AWS Config |
| Set up CUR + Budgets + Anomaly Detection | Billing, CUR, Budgets |
Phase 3: DevTools & CI/CD (Week 3–4)
| Task | Services |
|---|---|
| Provision CI/CD account in DevTools OU | Control Tower Account Factory |
| Build application CI/CD pipeline | CodePipeline, CodeBuild, CodeDeploy |
| Build infrastructure CI/CD pipeline | CodePipeline, CodeBuild, Terraform |
| Set up ECR and CodeArtifact | ECR, CodeArtifact |
| Create cross-account deploy roles in workload accounts | IAM |
| Deploy tag validation step in pipelines | CodeBuild |
| Deploy monitoring-as-code modules | CloudWatch, SNS |
| Configure EventBridge alerting rules (root login, SG changes, GuardDuty) | EventBridge, SNS |
| Set up AWS Chatbot for Slack/Teams notifications | AWS Chatbot |
Phase 4: Workloads (Week 4–5)
| Task | Services |
|---|---|
| Provision workload accounts (Dev, Staging, Prod) | AFT or CfCT |
| Deploy VPCs via IaC into each workload account | VPC, Subnets, TGW attachments |
| Provision developer sandbox accounts | Sandbox OU with SCPs + budget caps |
| Run first Well-Architected Review | Well-Architected Tool |
Phase 5: Multi-Region & DR (Week 5–6)
| Task | Services |
|---|---|
| Deploy Transit Gateway in DR region + inter-region peering | Transit Gateway |
| Configure Direct Connect Gateway association with DR TGW | Direct Connect |
| Deploy Aurora Global Database for Tier 1 workloads | Aurora |
| Set up S3 cross-region replication for critical buckets | S3 CRR |
| Configure Route 53 failover routing + health checks | Route 53 |
| Set up AWS Backup cross-region vault copies | AWS Backup |
| Conduct DR failover test (GameDay) | Fault Injection Service |
Phase 6: Optimization & Hardening (Week 6–7)
| Task | Services |
|---|---|
| Enable Compute Optimizer and right-sizing recommendations | Compute Optimizer |
| Purchase Savings Plans / Reserved Instances | Cost Management |
| Deploy Audit Manager frameworks | Audit Manager |
| Set up automated backup with restore testing | AWS Backup |
| GameDay / chaos engineering exercise | Fault Injection Service |
| QuickSight chargeback dashboards | QuickSight, Athena, CUR |
Conclusion
Building an enterprise AWS landing zone is not a weekend project — it's a deliberate, multi-phase effort that establishes the foundation for everything that follows. The key principles:
- Multi-account isolation — separate workloads, environments, and functions into distinct accounts
- Governance as code — SCPs, tag policies, and Config rules managed in Terraform, not the console
- Logging everything — CloudTrail, Config, VPC Flow Logs, DNS, and access logs flowing to an immutable Log Archive
- Tag everything — mandatory tags enforced at four layers: IaC defaults, CI/CD validation, SCPs, and Config rules
- Monitor at deploy time — CloudWatch alarms, dashboards, and SNS topics created alongside every resource
- CI/CD for everything — applications and infrastructure deploy through pipelines with approval gates and audit trails
- Multi-region resilience — DR region with Transit Gateway peering, Aurora Global, Route 53 failover, and tiered RPO/RTO
- Hybrid connectivity — Direct Connect (primary) + VPN (backup) reaching both regions via a single DX Gateway
- Proactive alerting — EventBridge + SNS + AWS Chatbot for real-time notification of security events, infrastructure changes, and cost anomalies
- Well-Architected by design — every component maps to one or more of the six pillars
Start with Control Tower + Terraform AFT. Build the foundation right, and every team — infrastructure, developers, security — has a governed, observable, cost-transparent environment to operate in.
References:
- Language
- English
Relevant content
- asked a year ago
