Skip to content

AWS Enterprise Landing Zone

48 minute read
Content level: Advanced
1

Building an Enterprise Landing Zone

Building a Secure, Scalable, and Well-Architected AWS Enterprise Landing Zone

A comprehensive guide to multi-account strategy, centralized logging, cost accountability, CI/CD, multi-region hybrid connectivity, and Infrastructure as Code — aligned to the AWS Well-Architected Framework.

Disclaimer: This guide presents a reference architecture based on AWS best practices and the Well-Architected Framework. Your actual implementation will vary depending on your organization's compliance requirements (e.g., HIPAA, PCI-DSS, FedRAMP, GDPR, SOC 2), industry regulations, existing infrastructure, risk tolerance, team size, and budget. Treat this as a starting point — not a prescriptive blueprint. Always validate architecture decisions with your security, compliance, and legal teams before deploying to production. AWS services, pricing, and features evolve frequently; verify current capabilities in the AWS documentation at the time of implementation.


Table of Contents

  1. Introduction
  2. Before You Begin — Preparation Checklist
  3. Architecture Overview
  4. Multi-Account Strategy & OU Structure
  5. AWS Well-Architected Framework Alignment
  6. Centralized Logging & Access Tracking
  7. Mandatory Tagging & Deployment Accountability
  8. Cost Allocation, Monitoring & Chargeback
  9. Services by Team
  10. CI/CD Pipeline Architecture
  11. Infrastructure as Code — Terraform Starter
  12. Multi-Region Architecture & Hybrid Connectivity
  13. Business Continuity & Disaster Recovery
  14. Alerting & Change Notifications
  15. Monitoring & Alerting Built Into Deployment
  16. Additional Best Practices & Considerations
  17. Phased Rollout Plan
  18. Conclusion

Introduction

Every enterprise AWS journey starts with the same question: How do we build a foundation that's secure, scalable, cost-transparent, and doesn't become unmanageable at scale?

The answer is a landing zone — a well-architected, multi-account AWS environment with built-in governance, security controls, centralized logging, cost allocation, and automated deployment pipelines. This guide covers everything you need to go from zero to a production-ready enterprise AWS environment, mapped to the six pillars of the AWS Well-Architected Framework.

We'll cover:

  • Multi-account strategy with AWS Organizations and Control Tower
  • Centralized logging for every API call, network flow, and resource change
  • Mandatory tagging to track who deployed what, for which department, at what cost
  • CI/CD pipelines for application and infrastructure deployments
  • Infrastructure as Code with Terraform (and CloudFormation alternatives)
  • Multi-region architecture with hybrid connectivity to on-premises via Direct Connect and VPN
  • Alerting and change notifications via EventBridge, SNS, and AWS Chatbot
  • Monitoring and alerting baked into every deployment — not bolted on afterward

Before You Begin — Preparation Checklist

Before deploying a single resource, get these decisions and prerequisites in place.

Organizational Decisions

DecisionWhat to DefineWhy It Matters
Account email strategyDedicated email distribution list per account (e.g., aws-security@company.com)AWS requires a unique email per account; DLs ensure team access, not individual dependency
Naming conventionsStandard for accounts, OUs, resources, tagsConsistency prevents confusion at scale
Region strategyPrimary region + DR region + denied regionsCompliance, latency, and cost implications
IP address plan (CIDR)Non-overlapping CIDR ranges across all VPCsYou will regret overlapping CIDRs; plan for 3–5 years of growth
Identity provider (IdP)Okta, Azure AD, Ping, or AWS-nativeFederated SSO is non-negotiable for enterprise
Compliance requirementsSOC 2, HIPAA, PCI-DSS, FedRAMP, GDPRDetermines log retention, encryption, and network controls
Cost center taxonomyDepartment → Cost Center → Project mappingRequired for chargeback/showback reporting
Change management processWho approves prod deployments? What's the rollback process?Must be defined before CI/CD pipelines are built

Technical Prerequisites

  • [ ] Management account created with MFA on root, no workloads deployed
  • [ ] AWS Organizations enabled with all features
  • [ ] Two dedicated email addresses for Log Archive and Audit accounts (Control Tower requirement)
  • [ ] Identity Provider configured and ready for SAML/OIDC federation
  • [ ] IP address plan documented — recommended: use AWS VPC IPAM for automated allocation
  • [ ] Terraform state backend — S3 bucket + DynamoDB table in a dedicated account
  • [ ] Git repository initialized for IaC code (GitHub, GitLab, or CodeCommit)
  • [ ] Cost allocation tags decided and documented (see Mandatory Tagging section)
  • [ ] Incident response plan — at minimum, define escalation paths and communication channels
  • [ ] AWS Support plan — Business or Enterprise Support for production workloads (access to Trusted Advisor checks, TAM, and 24/7 support)

Common Mistakes to Avoid

⚠️ Don't deploy workloads in the management account. It should only run Organizations, Control Tower, and billing. No EC2, no Lambda, no applications.

⚠️ Don't skip the IP address plan. Overlapping CIDRs between VPCs are extremely painful to fix after workloads are running.

⚠️ Don't use IAM users for human access. Use IAM Identity Center (SSO) with your corporate IdP from day one. IAM users are for service accounts only — and even those should use IAM roles where possible.

⚠️ Don't leave CloudTrail as a per-account afterthought. Set up the org-wide trail in the management account first, logging to the Log Archive account.


Architecture Overview

The architecture follows a hub-and-spoke model with centralized security, networking, and logging.

                            ┌─────────────────────┐
                            │   IAM Identity Center│
                            │   (Corporate IdP)    │
                            └──────────┬──────────┘
                                       │
                            ┌──────────▼──────────┐
                            │  Management Account  │
                            │  (Organizations,     │
                            │   Control Tower,     │
                            │   Billing)           │
                            └──────────┬──────────┘
               ┌───────────────┬───────┴──────┬────────────────┬──────────────┐
               ▼               ▼              ▼                ▼
    ┌──────────────┐  ┌──────────────┐ ┌────────────┐  ┌────────────┐
    │ Security OU  │  │  Infra OU    │ │Workloads OU│  │ DevTools OU│
    │              │  │              │ │            │  │            │
    │• Log Archive │  │• Network Hub │ │• Dev OU    │  │• CI/CD     │
    │• Security    │  │• Shared Svcs │ │• Staging OU│  │  Account   │
    │  Tooling     │  │• Backup      │ │• Prod OU   │  │            │
    │• Audit       │  │              │ │            │  │            │
    └──────────────┘  └──────┬───────┘ └─────┬──────┘  └─────┬──────┘
                             │               │               │
                      ┌──────▼───────────────▼───────────────▼────────────────┐
                      │              Transit Gateway (Hub)                      │
                      │         Network Hub Account — Infra OU                 │
                      └──────────┬──────────────────┬─────────────────────────┘
                                 │                  │
                      ┌──────────▼──────┐  ┌───────▼────────┐
                      │  Direct Connect │  │  AWS Network   │
                      │  / Site-to-Site │  │  Firewall      │
                      │  VPN            │  │  (Egress/E-W)  │
                      └─────────────────┘  └────────────────┘

Key design principles:

  • Blast radius isolation — each workload, environment, and function lives in its own AWS account
  • Centralized governance — SCPs, tag policies, and Config rules enforced at the organization level
  • Shared networking — Transit Gateway provides connectivity without VPC peering sprawl
  • Immutable logging — all logs flow to a dedicated Log Archive account with S3 Object Lock
  • CI/CD as a first-class citizen — dedicated DevTools account with cross-account deployment roles
  • Multi-region resilience — Transit Gateway inter-region peering with DR region for business continuity
  • Hybrid connectivity — Direct Connect (primary) + Site-to-Site VPN (backup) via Direct Connect Gateway reaching both regions

Multi-Account Strategy & OU Structure

Organizational Units (OUs)

Root
├── Security OU
│   ├── Log Archive          — CloudTrail, Config, VPC Flow Logs (immutable, S3 Object Lock)
│   ├── Security Tooling     — GuardDuty delegated admin, Security Hub, Inspector, Macie
│   └── Audit                — Read-only cross-account access for auditors & compliance
│
├── Infrastructure OU
│   ├── Network Hub          — Transit Gateway, Direct Connect, VPN, DNS, Network Firewall
│   ├── Shared Services      — Managed AD, internal tools, golden AMI pipeline, IPAM
│   └── Backup               — AWS Backup central vault, cross-account backup policies
│
├── Sandbox OU
│   └── Sandbox-{user}       — Experimentation (aggressive SCPs, budget caps, auto-nuke)
│
├── Workloads OU
│   ├── Dev OU
│   │   ├── App-A-Dev
│   │   └── App-B-Dev
│   ├── Staging OU
│   │   ├── App-A-Staging
│   │   └── App-B-Staging
│   └── Prod OU
│       ├── App-A-Prod
│       └── App-B-Prod
│
├── DevTools OU
│   └── CI/CD                — CodePipeline, CodeBuild, ECR, CodeArtifact
│
└── Suspended OU             — Decommissioned accounts (deny-all SCP attached)

Key Service Control Policies (SCPs)

SCPAttached ToWhat It Does
Deny root user actionsRoot OUBlocks all actions by the root user across all accounts
Restrict regionsRoot OUDenies API calls outside approved regions (e.g., us-east-1, us-west-2)
Require IMDSv2Root OUBlocks EC2 launches that don't enforce Instance Metadata Service v2
Deny leaving organizationRoot OUPrevents any account from removing itself from the org
Deny S3 public accessRoot OUBlocks PutBucketPolicy and PutBucketAcl that grant public access
Deny untagged resourcesWorkloads OU, DevTools OUBlocks resource creation without required tags
Deny expensive servicesSandbox OUBlocks Redshift, EMR, SageMaker large instances, etc.
Deny VPC peeringSandbox OUPrevents sandbox accounts from connecting to other networks
Deny allSuspended OUComplete lockout — only billing access remains
Protect log archiveSecurity OUDeny s3:DeleteObject, s3:PutBucketPolicy on log buckets

AWS Well-Architected Framework Alignment

Every component of this landing zone maps to one or more of the six pillars of the AWS Well-Architected Framework.

Pillar 1: Operational Excellence

The ability to support development and run workloads effectively, gain insight into operations, and continuously improve processes and procedures.

Best PracticeImplementation
Perform operations as codeAll infrastructure managed via Terraform/CloudFormation; no manual console changes
Make frequent, small, reversible changesCI/CD pipelines with blue/green and canary deployments
Refine operations procedures frequentlyRunbooks in Systems Manager; post-incident reviews
Anticipate failureGameDays, chaos engineering with AWS Fault Injection Service
Learn from all operational eventsCloudTrail + CloudWatch Logs Insights for incident analysis

Services: AWS Systems Manager, CloudFormation/Terraform, CloudWatch, AWS Health, Trusted Advisor

Pillar 2: Security

The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Best PracticeImplementation
Implement a strong identity foundationIAM Identity Center with corporate IdP; least-privilege permission sets; no IAM users for humans
Enable traceabilityOrg-wide CloudTrail; VPC Flow Logs; DNS query logging; S3 access logs
Apply security at all layersSCPs at org level; security groups at instance level; WAF at edge; Network Firewall at VPC level
Automate security best practicesConfig Rules with auto-remediation; GuardDuty auto-response via EventBridge + Lambda
Protect data in transit and at restKMS encryption for EBS, S3, RDS; ACM for TLS; VPN/Direct Connect for hybrid
Prepare for security eventsSecurity Hub aggregation; Detective for investigation; incident response runbooks in SSM

Services: IAM Identity Center, GuardDuty, Security Hub, Inspector, Macie, KMS, WAF, Shield, Network Firewall, CloudTrail, AWS Config

Pillar 3: Reliability

The ability of a workload to perform its intended function correctly and consistently when it's expected to.

Best PracticeImplementation
Automatically recover from failureAuto Scaling groups; multi-AZ RDS/Aurora; Route 53 failover routing for multi-region DR
Test recovery proceduresAWS Backup with periodic restore testing; DR runbooks; scheduled DR failover drills
Scale horizontallyECS/EKS with Fargate; ALB for load distribution
Manage change in automationIaC-only changes; drift detection; approval gates in CI/CD
Monitor and alarmCloudWatch alarms on key metrics; composite alarms; EventBridge rules for infrastructure state changes
Plan for disaster recoveryTiered DR strategy (active-active for Tier 1, pilot light for Tier 3); Aurora Global Database; TGW inter-region peering
Use fault isolation boundariesMulti-region architecture; multi-AZ within each region; separate blast radius per account

Services: Auto Scaling, ELB, Route 53 (health checks + failover routing), AWS Backup (cross-region vaults), S3 cross-region replication, Aurora Global Database, Transit Gateway inter-region peering, Direct Connect + VPN redundancy

Pillar 4: Performance Efficiency

The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes.

Best PracticeImplementation
Use serverless architecturesLambda for event processing; Fargate for containers; Aurora Serverless for variable DB loads
Go global in minutesCloudFront for content delivery; Route 53 latency-based routing
Experiment more oftenSandbox OU with budget caps for rapid experimentation
Use the right resource typeCompute Optimizer recommendations; rightsizing via Cost Explorer
Monitor performanceCloudWatch Container Insights for ECS/EKS; X-Ray for distributed tracing

Services: Lambda, Fargate, CloudFront, Compute Optimizer, X-Ray, CloudWatch

Pillar 5: Cost Optimization

The ability to run systems to deliver business value at the lowest price point.

Best PracticeImplementation
Implement cloud financial managementCUR + Athena + QuickSight for chargeback dashboards; dedicated FinOps team
Adopt a consumption modelAuto Scaling; Lambda pay-per-invocation; Fargate Spot
Measure overall efficiencyCost-per-transaction metrics; cost allocation by tag (Department, CostCenter, Project)
Stop spending money on undifferentiated heavy liftingManaged services (RDS over self-managed DB, EKS over self-managed K8s)
Analyze and attribute expenditureMandatory cost allocation tags; per-account budgets with anomaly detection

Services: Cost Explorer, AWS Budgets, Cost Anomaly Detection, CUR, Savings Plans, Compute Optimizer, S3 Intelligent-Tiering

Pillar 6: Sustainability

The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components.

Best PracticeImplementation
Understand your impactAWS Customer Carbon Footprint Tool dashboard
Maximize utilizationAuto Scaling to avoid over-provisioned idle resources; Spot instances for batch
Use managed servicesShared infrastructure (Lambda, Fargate, Aurora Serverless) is more efficient than dedicated EC2
Reduce downstream impactS3 lifecycle policies to move cold data to Glacier; delete unused EBS snapshots

Services: Customer Carbon Footprint Tool, Graviton (ARM) instances, S3 Intelligent-Tiering, Spot Instances


Centralized Logging & Access Tracking

All logs flow into the Log Archive account in the Security OU. This account has a protective SCP that denies deletion of log data.

Log Types and Sources

Log TypeAWS ServiceWhat It CapturesDestination
API ActivityCloudTrail (Org Trail)Every API call — who created/modified/deleted any resource, from which IP, with which roleS3 (Log Archive) + CloudWatch Logs
Resource ConfigurationAWS ConfigConfiguration timeline of every resource — before/after snapshotsS3 (Log Archive) + Config Aggregator
Network TrafficVPC Flow LogsAccepted/rejected flows — source/dest IP, port, bytes, actionS3 (Log Archive) + CloudWatch Logs
DNS QueriesRoute 53 Resolver Query LoggingEvery DNS query from VPCs — domain, source IP, responseS3 (Log Archive)
S3 Data AccessS3 Server Access Logging + CloudTrail Data EventsWho accessed which bucket/object, when, from whereS3 (Log Archive)
SSO/Login ActivityIAM Identity Center + CloudTrailWho logged in, which account, which permission set, MFA statusCloudTrail → S3
Load BalancerALB/NLB Access LogsClient IP, latency, status codes, target groupS3 (Log Archive)
FirewallAWS Network Firewall LogsAllowed/denied traffic through stateful and stateless rulesS3 + CloudWatch
WAFAWS WAF LogsWeb request inspection — blocked/allowed, rule matchesS3 / Kinesis Firehose
DatabaseRDS/Aurora Audit LogsSQL queries, login attempts, schema changesCloudWatch Logs → S3
ContainerECS/EKS + Container InsightsApplication stdout/stderr, K8s audit logs, resource metricsCloudWatch Logs → S3
LambdaCloudWatch Logs (automatic)Invocations, duration, errors, cold startsCloudWatch Logs → S3
Cost EventsCost & Usage Report (CUR)Hourly cost per resource, with tagsS3 (Billing account)

Log Retention & Lifecycle

TierRetention PeriodStorage ClassPurpose
Hot0 – 90 daysS3 StandardActive investigation, real-time queries
Warm90 – 365 daysS3 Glacier Instant RetrievalCompliance queries, incident lookback
Cold1 – 7 yearsS3 Glacier Deep ArchiveRegulatory retention (HIPAA: 6 yr, SOX: 7 yr)

Immutability: S3 Object Lock (WORM) is enabled on all log buckets. Even administrators cannot delete or overwrite log data during the retention period.

Log Analysis Stack

Use CaseToolDescription
Real-time queriesCloudWatch Logs InsightsSub-second queries on recent logs
Ad-hoc investigationAmazon AthenaSQL queries against S3 log partitions
Security correlationSecurity Hub + Amazon Security LakeAggregated findings with OCSF normalization
DashboardsCloudWatch Dashboards / Managed GrafanaOperational and security dashboards
Long-term SIEMSplunk / Datadog / Elastic (optional)For enterprises with existing SIEM investments

Mandatory Tagging & Deployment Accountability

Tag Schema

Every resource deployed in this environment must carry the following tags. Resources without required tags are blocked at creation time by SCPs.

Required Tags

Tag KeyExample ValuePurpose
DepartmentEngineering, Finance, SecurityWhich team/department owns this resource
CostCenterCC-1234Financial cost center for chargeback
Ownerjsmith@company.comIndividual who provisioned/owns the resource
Managermjones@company.comManager of the owner — escalation & approval audit
Environmentdev, staging, prodLifecycle stage
ProjectProjectAlphaWhich project this resource belongs to
DeployedByci/codepipeline, jsmith-manualHow the resource was deployed

Recommended Tags

Tag KeyExample ValuePurpose
DeployPipelineIdpipeline-abc123Links resource to exact CI/CD pipeline execution
ApplicationWebApp, DataPipelineApplication name for resource grouping
DataClassificationPublic, Internal, ConfidentialData sensitivity level
ComplianceHIPAA, SOC2, PCIApplicable compliance framework
ExpirationDate2026-06-30Auto-cleanup for temporary resources

Four Layers of Tag Enforcement

Layer 1: IaC Default Tags          → Terraform provider default_tags / CFN resource tags
          │                            (applied to every resource in the pipeline)
          ▼
Layer 2: CI/CD Validation          → Pipeline step validates all required tags before deploy
          │                            (fails the build if tags are missing)
          ▼
Layer 3: SCP Enforcement           → Organization SCP denies Create* APIs without tags
          │                            (catches manual console deployments)
          ▼
Layer 4: Config Rule Detection     → AWS Config required-tags rule + auto-remediation
                                      (detects tag drift, notifies owner, quarantines if needed)

Auto-Tagging for Manual Deployments

Even with all the above, someone will eventually create a resource via the console. To catch this:

  1. EventBridge rule triggers on CloudTrail Create* / RunInstances / CreateDBInstance events
  2. Lambda function reads the CloudTrail event and auto-tags the resource with:
    • CreatedBy = IAM principal ARN from the event
    • CreatedAt = event timestamp
    • CreatedVia = console / cli / sdk / terraform (derived from user agent)
  3. If required tags are still missing after 48 hours → SNS notification to team lead + Config non-compliant finding

Full Deployment Audit Trail

For every resource in your environment, you can answer: Who deployed it, when, how, for which project, under which cost center, approved by whom?

Git Commit (author + SHA + PR reviewer)
    → Pipeline Trigger (pipeline ID + source branch)
        → Approval Gate (who approved for staging/prod)
            → Deploy (tags: DeployedBy, PipelineId, CommitSHA)
                → CloudTrail (immutable API-level audit log)
                    → AWS Config (configuration timeline with tags)

Cost Allocation, Monitoring & Chargeback

Activating Cost Allocation Tags

In the Billing console of the management account, activate these tags as cost allocation tags:

  • Department
  • CostCenter
  • Project
  • Owner
  • Environment

Note: Tags only appear in billing data after activation. Historical data before activation is not retroactively tagged. Activate on day one.

Cost & Usage Report (CUR)

SettingConfiguration
Report nameenterprise-cur
Time granularityHourly
FormatApache Parquet
CompressionParquet (columnar, efficient for Athena)
S3 bucketDedicated bucket in management or billing account
IntegrationAthena, QuickSight, Redshift
Resource-level dataEnabled (includes individual resource IDs)
Tag columnsAll activated cost allocation tags included

Budget Alerts

Budget TypeScopeThresholdsAction
Per-account monthlyEach linked account50%, 80%, 100% of budgetSNS notification to account owner + finance
Per-cost-centerFilter by CostCenter tag80%, 100%SNS to cost center owner
Per-projectFilter by Project tag80%, 100%SNS to project lead
Anomaly detectionPer linked accountAuto-detected anomaliesSNS + optional Lambda to stop non-prod instances

Chargeback Pipeline

Tagged Resources → CUR to S3 (hourly) → Athena (GROUP BY CostCenter, Department)
    → QuickSight Dashboard (monthly chargeback by team)
        → Automated PDF reports emailed to cost center owners

Example Athena Query — Cost by Department

SELECT
    line_item_product_code AS service,
    resource_tags_user_department AS department,
    resource_tags_user_cost_center AS cost_center,
    SUM(line_item_unblended_cost) AS total_cost
FROM cur_database.cur_table
WHERE month = '4' AND year = '2026'
GROUP BY 1, 2, 3
ORDER BY total_cost DESC
LIMIT 50;

Services by Team

Infrastructure Team

CategoryServices
ComputeEC2, ECS, EKS, Lambda, Auto Scaling Groups
NetworkingVPC, Transit Gateway, Route 53, CloudFront, ALB/NLB, VPC IPAM
StorageS3, EBS, EFS, FSx for Lustre / Windows
DatabaseRDS, Aurora, DynamoDB, ElastiCache, MemoryDB
HybridDirect Connect, Site-to-Site VPN, AWS Outposts
OperationsSystems Manager, Patch Manager, AWS Backup, AWS Health
MonitoringCloudWatch, X-Ray, Managed Grafana, Managed Prometheus
IaCTerraform, CloudFormation, Service Catalog, CDK

Developer / DevOps Team

CategoryServices
Source ControlGitHub / GitLab integration (CodeCommit is deprecated)
CI/CDCodePipeline + CodeBuild + CodeDeploy
Container RegistryAmazon ECR
Package ManagementCodeArtifact (npm, Maven, pip)
IDECloud9, VS Code with Amazon Q Developer
Security ScanningCodeGuru Reviewer, Inspector (container images), Snyk integration
TestingCodeBuild + testing frameworks, AWS Device Farm
Infrastructure PipelineTerraform Cloud / Atlantis / CodePipeline for IaC

Security Team

CategoryServices
Identity & AccessIAM Identity Center, AWS Organizations SCPs, Permission Boundaries
Threat DetectionGuardDuty, Security Hub, Amazon Detective, Macie
Network ProtectionAWS WAF, Shield Advanced, Network Firewall
EncryptionKMS (multi-region keys), ACM, CloudHSM
ComplianceAWS Config Rules, Audit Manager, Security Lake
Incident ResponseEventBridge → Step Functions → Lambda automation

CI/CD Pipeline Architecture

Application CI/CD

GitHub (webhook)
    → CodePipeline
        → Source Stage: pull code + resolve dependencies
        → Build Stage: CodeBuild
            • Docker build
            • Unit tests + integration tests
            • SAST scanning (CodeGuru Reviewer)
            • Container image scan (Inspector)
        → Artifact Stage: push to ECR / CodeArtifact
        → Deploy Dev: auto-deploy to ECS/EKS dev (blue/green)
        → Manual Approval: required for staging and prod
        → Deploy Staging: deploy + smoke tests
        → Manual Approval: prod gate
        → Deploy Prod: canary or blue/green via CodeDeploy

Infrastructure CI/CD

Git push (Terraform code)
    → CodePipeline
        → Source Stage: pull IaC repo
        → Plan Stage: CodeBuild runs `terraform plan`
            • Plan output posted as PR comment or artifact
            • Tag validation: check all resources have required tags
            • Cost estimation: Infracost or tfcost
        → Manual Approval: review plan output
        → Apply Stage: CodeBuild runs `terraform apply`
        → Drift Detection: scheduled `terraform plan` (no apply) to detect drift

Cross-Account Deployment Pattern

The CI/CD account (DevTools OU) assumes roles in target workload accounts:

CI/CD Account (DevTools OU)
    │
    ├── AssumeRole → Dev Account (CodePipelineDeployRole)
    ├── AssumeRole → Staging Account (CodePipelineDeployRole)
    └── AssumeRole → Prod Account (CodePipelineDeployRole)

Each CodePipelineDeployRole has:

  • Least-privilege permissions scoped to the specific services being deployed
  • Trust policy limited to the CI/CD account
  • External ID for additional security
  • CloudTrail logging of every AssumeRole call

Infrastructure as Code — Terraform Starter

Recommended Directory Structure

terraform-landing-zone/
├── modules/
│   ├── organization/           # AWS Organizations, OUs, SCPs, Tag Policies
│   │   ├── main.tf
│   │   ├── ous.tf
│   │   ├── scps.tf
│   │   ├── tag-policies.tf
│   │   └── variables.tf
│   ├── networking/             # Transit Gateway, VPCs, Subnets, IPAM
│   │   ├── main.tf
│   │   ├── transit-gw.tf
│   │   ├── vpc.tf
│   │   ├── network-firewall.tf
│   │   └── variables.tf
│   ├── security/               # GuardDuty, Security Hub, Inspector, Config
│   │   ├── guardduty.tf
│   │   ├── security-hub.tf
│   │   ├── config.tf
│   │   └── inspector.tf
│   ├── identity/               # IAM Identity Center, Permission Sets
│   │   ├── sso.tf
│   │   └── permission-sets.tf
│   ├── logging/                # CloudTrail org trail, S3 log archive, VPC Flow Logs
│   │   ├── cloudtrail.tf
│   │   ├── s3-log-archive.tf
│   │   ├── vpc-flow-logs.tf
│   │   └── config-recorder.tf
│   ├── governance/             # Tag policies, SCPs, Config Rules, auto-tagger Lambda
│   │   ├── tag-policy.tf
│   │   ├── scp-require-tags.tf
│   │   ├── config-rules.tf
│   │   └── auto-tagger.tf
│   ├── monitoring/             # CloudWatch alarms, dashboards, SNS topics
│   │   ├── alarms.tf
│   │   ├── dashboards.tf
│   │   └── sns-topics.tf
│   ├── cost/                   # Budgets, CUR, anomaly detection
│   │   ├── budgets.tf
│   │   ├── cur.tf
│   │   └── anomaly-detection.tf
│   └── ...
│
├── environments/
│   ├── management/             # Management account bootstrap
│   ├── security/               # Security tooling account
│   ├── network/                # Network hub account
│   ├── shared-services/        # AD, internal tools
│   ├── cicd/                   # DevTools account
│   ├── dev/                    # Workload dev
│   ├── staging/                # Workload staging
│   └── prod/                   # Workload prod
│
├── aft-config/                 # Account Factory for Terraform
│   ├── account-request/        # New account definitions
│   ├── account-customizations/ # Per-account Terraform
│   └── global-customizations/  # Applied to all new accounts
│
└── pipelines/
    ├── buildspec-plan.yml      # CodeBuild: terraform plan
    └── buildspec-apply.yml     # CodeBuild: terraform apply

Key Terraform: Provider Default Tags

# environments/{env}/main.tf
# Every resource in this environment automatically inherits these tags

provider "aws" {
  region = var.region

  default_tags {
    tags = {
      Environment      = var.environment       # "dev", "staging", "prod"
      Department       = var.department         # "Engineering"
      CostCenter       = var.cost_center        # "CC-1234"
      Owner            = var.deployer_email     # "jsmith@company.com"
      Manager          = var.manager_email      # "mjones@company.com"
      Project          = var.project_name       # "ProjectAlpha"
      DeployedBy       = "ci/terraform"
      DeployPipelineId = var.pipeline_execution_id
      ManagedBy        = "terraform"
    }
  }
}

Key Terraform: Organization & SCPs

# modules/organization/main.tf

resource "aws_organizations_organization" "org" {
  aws_service_access_principals = [
    "controltower.amazonaws.com",
    "sso.amazonaws.com",
    "config-multiaccountsetup.amazonaws.com",
    "guardduty.amazonaws.com",
    "securityhub.amazonaws.com",
    "cloudtrail.amazonaws.com",
    "tagpolicies.tag.amazonaws.com",
    "backup.amazonaws.com",
  ]
  feature_set          = "ALL"
  enabled_policy_types = ["SERVICE_CONTROL_POLICY", "TAG_POLICY"]
}

# SCP: Deny resource creation without required tags
resource "aws_organizations_policy" "require_tags" {
  name = "require-mandatory-tags"
  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid      = "DenyEC2WithoutTags"
        Effect   = "Deny"
        Action   = ["ec2:RunInstances"]
        Resource = ["arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:volume/*"]
        Condition = {
          "Null" = {
            "aws:RequestTag/Department"  = "true"
            "aws:RequestTag/CostCenter"  = "true"
            "aws:RequestTag/Owner"       = "true"
            "aws:RequestTag/Manager"     = "true"
            "aws:RequestTag/Environment" = "true"
          }
        }
      }
    ]
  })
}

Key Terraform: CloudTrail Org Trail with Immutable S3

# modules/logging/cloudtrail.tf

resource "aws_cloudtrail" "org_trail" {
  name                       = "enterprise-org-trail"
  s3_bucket_name             = aws_s3_bucket.log_archive.id
  is_organization_trail      = true
  is_multi_region_trail      = true
  enable_log_file_validation = true
  kms_key_id                 = aws_kms_key.log_encryption.arn

  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_cw.arn

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3"]
    }
  }
}

# Immutable log bucket
resource "aws_s3_bucket" "log_archive" {
  bucket              = "enterprise-log-archive-${data.aws_caller_identity.current.account_id}"
  object_lock_enabled = true
}

resource "aws_s3_bucket_lifecycle_configuration" "log_lifecycle" {
  bucket = aws_s3_bucket.log_archive.id

  rule {
    id     = "log-tiering"
    status = "Enabled"
    transition { days = 90;  storage_class = "GLACIER_IR" }
    transition { days = 365; storage_class = "DEEP_ARCHIVE" }
  }
}

Key Terraform: AWS Config — Tag Compliance

# modules/governance/config-rules.tf

resource "aws_config_config_rule" "required_tags" {
  name = "required-tags-check"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key = "Department"
    tag2Key = "CostCenter"
    tag3Key = "Owner"
    tag4Key = "Manager"
    tag5Key = "Environment"
    tag6Key = "DeployedBy"
  })

  scope {
    compliance_resource_types = [
      "AWS::EC2::Instance",
      "AWS::RDS::DBInstance",
      "AWS::S3::Bucket",
      "AWS::Lambda::Function",
      "AWS::ElasticLoadBalancingV2::LoadBalancer",
    ]
  }
}

Key Terraform: Budget Alerts

# modules/cost/budgets.tf

resource "aws_budgets_budget" "account_monthly" {
  name         = "account-monthly-${var.account_name}"
  budget_type  = "COST"
  limit_amount = var.monthly_budget_limit
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email, var.finance_email]
  }
}

resource "aws_ce_anomaly_monitor" "account" {
  name              = "account-anomaly-${var.account_name}"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

Multi-Region Architecture & Hybrid Connectivity

This section covers the multi-region network topology with on-premises connectivity using Direct Connect (primary) and Site-to-Site VPN (backup), all routed through Transit Gateway.

Network Topology

                        On-Premises Data Center(s)
                        ┌──────────────────────────────────────────┐
                        │  Corporate Network                       │
                        │  ┌─────────────┐    ┌─────────────────┐  │
                        │  │ Customer    │    │ Customer        │  │
                        │  │ Router (DX) │    │ Router (VPN)    │  │
                        │  └──────┬──────┘    └───────┬─────────┘  │
                        └─────────┼───────────────────┼────────────┘
                                  │ Primary           │ Backup
                        ┌─────────▼─────────┐ ┌──────▼──────────┐
                        │  AWS Direct       │ │  AWS Site-to-   │
                        │  Connect          │ │  Site VPN       │
                        │  (1 or 10 Gbps)   │ │  (IPsec, ECMP)  │
                        └─────────┬─────────┘ └──────┬──────────┘
                                  │                   │
                        ┌─────────▼───────────────────▼──────────┐
                        │       Direct Connect Gateway           │
                        │       (Global — not region-specific)   │
                        └────────┬────────────────────┬──────────┘
                                 │                    │
              ┌──────────────────▼──────┐  ┌─────────▼───────────────────┐
              │  PRIMARY REGION         │  │  DR REGION                  │
              │  (e.g., us-east-1)      │  │  (e.g., us-west-2)         │
              │                         │  │                             │
              │  ┌───────────────────┐  │  │  ┌───────────────────────┐  │
              │  │ Transit Gateway   │◄─┼──┼─►│ Transit Gateway       │  │
              │  │ (Primary)        │  │  │  │ (DR)                  │  │
              │  └──┬──┬──┬──┬─────┘  │  │  └──┬──┬──┬─────────────┘  │
              │     │  │  │  │        │  │     │  │  │               │
              │   VPCs: │  │  │       │  │   VPCs: │  │              │
              │  Prod  Dev CI/CD NW   │  │  Prod  Shared  NW         │
              │  Firewall             │  │  Firewall                  │
              │                       │  │                            │
              └───────────────────────┘  └────────────────────────────┘
                        │                            │
                        └──── TGW Inter-Region ──────┘
                              Peering

Direct Connect — Primary Path

SettingRecommendation
Connection typeDedicated connection (1 Gbps or 10 Gbps) for production; Hosted connection for smaller bandwidth
RedundancyTwo connections from different Direct Connect locations (e.g., one from EqDC2, one from CoreSite) for high availability
Direct Connect GatewayAttach the DX Gateway to Transit Gateways in both primary and DR regions — a single DX connection reaches both regions
Virtual InterfaceTransit Virtual Interface (Transit VIF) → Direct Connect Gateway → Transit Gateway
BGPPrivate ASN on-prem; advertise on-prem routes; receive AWS VPC routes via BGP propagation
EncryptionMACsec (Layer 2 encryption) on 10 Gbps dedicated connections — or run IPsec VPN over the DX connection for in-transit encryption
MonitoringCloudWatch metrics: ConnectionState, ConnectionBpsEgress, ConnectionBpsIngress — alarm on state change

Site-to-Site VPN — Backup Path (Optional)

SettingRecommendation
PurposeFailover path if Direct Connect goes down; also useful for initial setup while DX is being provisioned (DX can take weeks)
AttachmentVPN attached to the same Transit Gateway as the DX
ECMPEnable ECMP on Transit Gateway for multiple VPN tunnels — increases aggregate bandwidth (each tunnel = ~1.25 Gbps)
RoutingBGP with lower priority (longer AS path or lower local preference) so traffic prefers DX when available
EncryptionIPsec — AES-256, SHA-256, DH Group 20+
Accelerated VPNEnable AWS Global Accelerator for VPN to reduce latency and jitter over public internet
MonitoringCloudWatch metrics: TunnelState, TunnelDataIn, TunnelDataOut — alarm when tunnels go down

Transit Gateway — Regional Hub

SettingRecommendation
Route tablesSegmented route tables: one for prod VPCs, one for non-prod, one for shared services — prevents dev from reaching prod directly
Inter-region peeringTGW peering between primary (us-east-1) and DR (us-west-2) regions — encrypted, runs over AWS backbone (not public internet)
Route propagationOn-prem routes propagate from DX/VPN attachment to all TGW route tables; VPC routes propagate to the on-prem route table
Blackhole routesAdd blackhole routes for denied traffic (e.g., sandbox OU CIDRs should not reach on-prem)
Network FirewallInspection VPC in the Network Hub account — all east-west and egress traffic routed through AWS Network Firewall
Flow LogsTGW Flow Logs enabled → S3 (Log Archive account) for traffic analysis between all attachments
SharingShare the TGW via AWS RAM to all workload accounts in the organization

DNS Resolution (Hybrid)

ComponentConfiguration
Route 53 Private Hosted ZonesOne per domain (e.g., internal.company.com), shared via RAM to all workload accounts
Route 53 Resolver Inbound EndpointsIn the Network Hub VPC — allows on-prem DNS servers to resolve AWS private domains
Route 53 Resolver Outbound EndpointsIn the Network Hub VPC — allows AWS resources to resolve on-prem DNS domains
Resolver RulesForward rules for on-prem domains (e.g., *.corp.company.com → on-prem DNS servers) shared via RAM
Query LoggingAll DNS queries logged to S3 (Log Archive) and CloudWatch Logs

Security & Logging for Hybrid Connectivity

All hybrid traffic adheres to the same security and logging standards as intra-AWS traffic:

ControlImplementation
Encryption in transitDX: MACsec or IPsec overlay; VPN: IPsec (always encrypted)
Network FirewallAll traffic between on-prem and VPCs passes through the Network Firewall inspection VPC
TGW Flow LogsCaptures all traffic crossing the Transit Gateway — source/dest, bytes, action
VPC Flow LogsPer-VPC flow logs in every workload account → S3 (Log Archive)
CloudTrailAll networking API calls logged (CreateVpnConnection, CreateTransitGatewayPeeringAttachment, etc.)
DX/VPN monitoringCloudWatch alarms on ConnectionState (DX) and TunnelState (VPN) — SNS alert on failover
Route 53 query logsAll DNS queries logged — detect unauthorized DNS resolution attempts
AWS ConfigTracks changes to TGW route tables, VPN configs, security groups, NACLs

Terraform — Direct Connect + VPN + Transit Gateway

# modules/networking/transit-gw.tf

# Primary region Transit Gateway
resource "aws_ec2_transit_gateway" "primary" {
  description                     = "Enterprise TGW - Primary Region"
  amazon_side_asn                 = 64512
  auto_accept_shared_attachments  = "disable"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  dns_support                     = "enable"
  transit_gateway_cidr_blocks     = [var.tgw_cidr]

  tags = { Name = "enterprise-tgw-primary" }
}

# Share TGW via RAM
resource "aws_ram_resource_share" "tgw_share" {
  name                      = "tgw-org-share"
  allow_external_principals = false
}

resource "aws_ram_resource_association" "tgw" {
  resource_arn       = aws_ec2_transit_gateway.primary.arn
  resource_share_arn = aws_ram_resource_share.tgw_share.arn
}

resource "aws_ram_principal_association" "org" {
  principal          = aws_organizations_organization.org.arn
  resource_share_arn = aws_ram_resource_share.tgw_share.arn
}

# TGW Route Tables — segmented
resource "aws_ec2_transit_gateway_route_table" "prod" {
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  tags               = { Name = "tgw-rt-prod" }
}

resource "aws_ec2_transit_gateway_route_table" "non_prod" {
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  tags               = { Name = "tgw-rt-non-prod" }
}

resource "aws_ec2_transit_gateway_route_table" "shared" {
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  tags               = { Name = "tgw-rt-shared-services" }
}

# TGW Flow Logs
resource "aws_ec2_transit_gateway_flow_log" "tgw_flow" {
  transit_gateway_id             = aws_ec2_transit_gateway.primary.id
  log_destination                = aws_s3_bucket.log_archive.arn
  log_destination_type           = "s3"
  traffic_type                   = "ALL"
  max_aggregation_interval       = 60
  tags                           = { Name = "tgw-flow-logs" }
}
# modules/networking/direct-connect.tf

# Direct Connect Gateway (global resource)
resource "aws_dx_gateway" "main" {
  name            = "enterprise-dx-gateway"
  amazon_side_asn = "64513"
}

# Associate DX Gateway with Primary TGW
resource "aws_dx_gateway_association" "primary" {
  dx_gateway_id         = aws_dx_gateway.main.id
  associated_gateway_id = aws_ec2_transit_gateway.primary.id

  allowed_prefixes = var.aws_cidr_blocks  # CIDRs to advertise to on-prem
}

# Associate DX Gateway with DR TGW (multi-region)
resource "aws_dx_gateway_association" "dr" {
  provider              = aws.dr_region
  dx_gateway_id         = aws_dx_gateway.main.id
  associated_gateway_id = aws_ec2_transit_gateway.dr.id

  allowed_prefixes = var.aws_cidr_blocks_dr
}
# modules/networking/vpn-backup.tf

# Customer Gateway (on-prem router)
resource "aws_customer_gateway" "onprem" {
  bgp_asn    = var.onprem_bgp_asn    # e.g., 65000
  ip_address = var.onprem_public_ip
  type       = "ipsec.1"
  tags       = { Name = "onprem-cgw" }
}

# Site-to-Site VPN attached to Transit Gateway
resource "aws_vpn_connection" "backup" {
  customer_gateway_id = aws_customer_gateway.onprem.id
  transit_gateway_id  = aws_ec2_transit_gateway.primary.id
  type                = "ipsec.1"
  static_routes_only  = false  # Use BGP

  enable_acceleration = true   # Global Accelerator for VPN

  tunnel1_inside_cidr   = var.tunnel1_cidr
  tunnel2_inside_cidr   = var.tunnel2_cidr

  tags = { Name = "onprem-backup-vpn" }
}

# CloudWatch alarm — VPN tunnel down
resource "aws_cloudwatch_metric_alarm" "vpn_tunnel_down" {
  alarm_name          = "vpn-tunnel-down"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 2
  metric_name         = "TunnelState"
  namespace           = "AWS/VPN"
  period              = 300
  statistic           = "Maximum"
  threshold           = 1
  alarm_description   = "VPN tunnel is down"
  alarm_actions       = [aws_sns_topic.infra_alerts_critical.arn]

  dimensions = {
    VpnId = aws_vpn_connection.backup.id
  }
}
# modules/networking/tgw-peering.tf

# Inter-region TGW peering (primary ↔ DR)
resource "aws_ec2_transit_gateway_peering_attachment" "primary_to_dr" {
  peer_region             = var.dr_region
  peer_transit_gateway_id = aws_ec2_transit_gateway.dr.id
  transit_gateway_id      = aws_ec2_transit_gateway.primary.id

  tags = { Name = "tgw-peering-primary-to-dr" }
}

# Accept the peering in DR region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "dr" {
  provider                      = aws.dr_region
  transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.primary_to_dr.id

  tags = { Name = "tgw-peering-dr-accept" }
}

# Route on-prem traffic to DR region via peering
resource "aws_ec2_transit_gateway_route" "dr_to_onprem" {
  provider                       = aws.dr_region
  destination_cidr_block         = var.onprem_cidr
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_peering_attachment.primary_to_dr.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.dr_shared.id
}

Business Continuity & Disaster Recovery

DR Strategy Tiers

Not all workloads need the same DR posture. Define tiers based on criticality:

TierStrategyRPORTOWorkload ExamplesAWS Implementation
Tier 1 — CriticalActive-Active (Multi-Region)Near-zero< 5 minCustomer-facing APIs, auth servicesAurora Global Database, Route 53 failover, ECS in both regions
Tier 2 — ImportantWarm Standby< 15 min< 30 minInternal apps, CI/CDScaled-down replicas in DR region, AMIs replicated, RDS read replicas
Tier 3 — StandardPilot Light< 1 hour< 4 hoursBatch processing, analyticsCore infra running in DR (networking, DB replicas), compute off
Tier 4 — Non-CriticalBackup & Restore< 24 hours< 24 hoursDev/sandbox, archivalS3 cross-region replication, AWS Backup cross-region vaults

Multi-Region Services

ServiceMulti-Region Capability
AuroraAurora Global Database — 1 primary region (read/write), up to 5 secondary regions (read-only, < 1 second replication lag); failover promotes secondary to primary
DynamoDBGlobal Tables — multi-region, multi-active; automatic replication
S3Cross-Region Replication (CRR) — async replication with optional RTC (Replication Time Control, < 15 min SLA)
ECS/EKSDeploy identical task definitions/deployments in DR region; use Route 53 for traffic steering
LambdaDeploy functions in both regions; no state to replicate
Secrets ManagerMulti-region secrets with automatic replication
KMSMulti-region keys — same key material in both regions for seamless encryption/decryption
Route 53Health checks + failover routing policies — automatic DNS failover
AWS BackupCross-region backup copies — automated via backup plans

Route 53 Failover Routing

# modules/dr/route53-failover.tf

resource "aws_route53_health_check" "primary_alb" {
  fqdn              = var.primary_alb_dns
  port              = 443
  type              = "HTTPS"
  request_interval  = 10
  failure_threshold = 3

  tags = { Name = "primary-region-health-check" }
}

resource "aws_route53_record" "app_primary" {
  zone_id = var.hosted_zone_id
  name    = "app.company.com"
  type    = "A"

  alias {
    name                   = var.primary_alb_dns
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary_alb.id
}

resource "aws_route53_record" "app_secondary" {
  zone_id = var.hosted_zone_id
  name    = "app.company.com"
  type    = "A"

  alias {
    name                   = var.dr_alb_dns
    zone_id                = var.dr_alb_zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "secondary"
}

Alerting & Change Notifications

AWS recommends a layered alerting architecture using EventBridge as the central event bus, SNS for notification delivery, and CloudWatch Alarms for metric-based thresholds. This provides real-time visibility into changes, security events, cost anomalies, and operational issues.

Alerting Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         Event Sources                            │
│  CloudTrail │ Config │ GuardDuty │ Health │ CloudWatch │ Budgets │
└──────┬──────┴───┬────┴─────┬─────┴───┬────┴─────┬─────┴────┬────┘
       │          │          │         │          │          │
       ▼──────────▼──────────▼─────────▼──────────▼──────────▼
       │            Amazon EventBridge (Default Bus)          │
       │            + Custom Rules per event pattern           │
       └───┬───────────┬───────────┬───────────┬──────────────┘
           │           │           │           │
    ┌──────▼──┐  ┌─────▼────┐  ┌──▼────┐  ┌───▼──────────┐
    │ SNS     │  │ Lambda   │  │ SQS   │  │ AWS Chatbot  │
    │ Topics  │  │ Auto-    │  │ Queue  │  │ (Slack/Teams)│
    │ (email, │  │ remediate│  │ (batch)│  │              │
    │  PagerDuty)│         │  │       │  │              │
    └─────────┘  └──────────┘  └───────┘  └──────────────┘

EventBridge Rules — What to Alert On

Event SourceEvent PatternAlert SeverityNotification Target
CloudTrailRoot user login🔴 CriticalSecurity team SNS + PagerDuty
CloudTrailConsole login without MFA🔴 CriticalSecurity team SNS
CloudTrailIAM policy changes (PutRolePolicy, AttachRolePolicy)🟡 WarningSecurity team SNS
CloudTrailSecurity group changes (AuthorizeSecurityGroupIngress)🟡 WarningInfra team SNS + Slack
CloudTrailS3 bucket policy changes🟡 WarningSecurity team SNS
CloudTrailKMS key deletion scheduled🔴 CriticalSecurity team SNS + PagerDuty
GuardDutyHIGH or CRITICAL severity finding🔴 CriticalSecurity team SNS + PagerDuty + Lambda (auto-isolate)
AWS ConfigNon-compliant resource (missing tags)🟡 WarningTag violations SNS → team lead
AWS ConfigSecurity group open to 0.0.0.0/0🔴 CriticalSecurity team SNS + Lambda (auto-remediate)
AWS HealthScheduled maintenance or service event🟡 WarningInfra team SNS + Slack
Budgets80% / 100% threshold breach🟡 Warning / 🔴 CriticalCost alerts SNS → finance + account owner
Cost Anomaly DetectionAnomaly detected🟡 WarningCost alerts SNS → finance
CloudWatch AlarmEC2 StatusCheckFailed🔴 CriticalInfra critical SNS + PagerDuty
CloudWatch AlarmRDS CPU > 90% for 10 min🟡 WarningInfra warning SNS + Slack
DX/VPNConnection state change (DX down, VPN tunnel down)🔴 CriticalInfra critical SNS + PagerDuty
Route 53Health check failure (DR failover triggered)🔴 CriticalInfra critical SNS + PagerDuty

AWS Chatbot — Slack/Teams Integration (Recommended)

AWS recommends AWS Chatbot for team-level notifications. It integrates directly with Slack and Microsoft Teams, rendering CloudWatch alarms, Security Hub findings, and EventBridge events as interactive cards.

ConfigurationSetting
Slack channel: #infra-alertsCloudWatch alarms (warning + critical), AWS Health events
Slack channel: #security-alertsGuardDuty findings, Config non-compliance, IAM changes
Slack channel: #cost-alertsBudget breaches, cost anomalies
Slack channel: #deploy-notificationsCodePipeline state changes (started, succeeded, failed)

Terraform — EventBridge Rules & SNS

# modules/alerting/eventbridge-rules.tf

# Rule: Root user login
resource "aws_cloudwatch_event_rule" "root_login" {
  name        = "detect-root-login"
  description = "Alert on any root user console login"

  event_pattern = jsonencode({
    source      = ["aws.signin"]
    detail-type = ["AWS Console Sign In via CloudTrail"]
    detail = {
      userIdentity = {
        type = ["Root"]
      }
    }
  })
}

resource "aws_cloudwatch_event_target" "root_login_sns" {
  rule = aws_cloudwatch_event_rule.root_login.name
  arn  = aws_sns_topic.security_alerts.arn
}

# Rule: Security group opened to the world
resource "aws_cloudwatch_event_rule" "sg_open_to_world" {
  name = "detect-sg-open-to-world"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["AWS API Call via CloudTrail"]
    detail = {
      eventName = ["AuthorizeSecurityGroupIngress"]
    }
  })
}

resource "aws_cloudwatch_event_target" "sg_lambda" {
  rule = aws_cloudwatch_event_rule.sg_open_to_world.name
  arn  = aws_lambda_function.sg_auto_remediate.arn
}

# Rule: GuardDuty HIGH/CRITICAL findings
resource "aws_cloudwatch_event_rule" "guardduty_high" {
  name = "guardduty-high-severity"

  event_pattern = jsonencode({
    source      = ["aws.guardduty"]
    detail-type = ["GuardDuty Finding"]
    detail = {
      severity = [{ numeric = [">=", 7] }]
    }
  })
}

resource "aws_cloudwatch_event_target" "guardduty_sns" {
  rule = aws_cloudwatch_event_rule.guardduty_high.name
  arn  = aws_sns_topic.security_alerts.arn
}

# Rule: CodePipeline state changes (deploy notifications)
resource "aws_cloudwatch_event_rule" "pipeline_state" {
  name = "codepipeline-state-change"

  event_pattern = jsonencode({
    source      = ["aws.codepipeline"]
    detail-type = ["CodePipeline Pipeline Execution State Change"]
    detail = {
      state = ["SUCCEEDED", "FAILED", "CANCELED"]
    }
  })
}

resource "aws_cloudwatch_event_target" "pipeline_sns" {
  rule = aws_cloudwatch_event_rule.pipeline_state.name
  arn  = aws_sns_topic.deploy_notifications.arn
}

# Rule: DX connection state change
resource "aws_cloudwatch_event_rule" "dx_state_change" {
  name = "direct-connect-state-change"

  event_pattern = jsonencode({
    source      = ["aws.directconnect"]
    detail-type = ["Direct Connect Connection State Change"]
  })
}

resource "aws_cloudwatch_event_target" "dx_state_sns" {
  rule = aws_cloudwatch_event_rule.dx_state_change.name
  arn  = aws_sns_topic.infra_alerts_critical.arn
}
# modules/alerting/sns-topics.tf

resource "aws_sns_topic" "security_alerts" {
  name              = "security-alerts"
  kms_master_key_id = aws_kms_key.sns_encryption.id
  tags              = { Name = "security-alerts" }
}

resource "aws_sns_topic" "infra_alerts_critical" {
  name              = "infra-alerts-critical"
  kms_master_key_id = aws_kms_key.sns_encryption.id
  tags              = { Name = "infra-alerts-critical" }
}

resource "aws_sns_topic" "infra_alerts_warning" {
  name              = "infra-alerts-warning"
  kms_master_key_id = aws_kms_key.sns_encryption.id
  tags              = { Name = "infra-alerts-warning" }
}

resource "aws_sns_topic" "cost_alerts" {
  name              = "cost-alerts"
  kms_master_key_id = aws_kms_key.sns_encryption.id
  tags              = { Name = "cost-alerts" }
}

resource "aws_sns_topic" "deploy_notifications" {
  name              = "deploy-notifications"
  kms_master_key_id = aws_kms_key.sns_encryption.id
  tags              = { Name = "deploy-notifications" }
}

# SNS Topic Policy — allow EventBridge to publish
resource "aws_sns_topic_policy" "allow_eventbridge" {
  for_each = toset([
    aws_sns_topic.security_alerts.arn,
    aws_sns_topic.infra_alerts_critical.arn,
    aws_sns_topic.cost_alerts.arn,
    aws_sns_topic.deploy_notifications.arn,
  ])

  arn    = each.value
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Sid       = "AllowEventBridge"
      Effect    = "Allow"
      Principal = { Service = "events.amazonaws.com" }
      Action    = "sns:Publish"
      Resource  = each.value
    }]
  })
}

Monitoring & Alerting Built Into Deployment

Every Terraform module in this landing zone includes monitoring resources alongside the infrastructure they monitor. Monitoring is not a follow-up task — it deploys with the resource.

Monitoring-as-Code: What Gets Created With Every Deployment

Resource DeployedMonitoring Created Alongside
EC2 InstanceCloudWatch alarm: CPU > 85% for 5 min; StatusCheckFailed alarm; disk/memory via CloudWatch Agent
RDS InstanceCloudWatch alarms: CPUUtilization, FreeableMemory, DatabaseConnections, ReadLatency, ReplicaLag
ALBCloudWatch alarms: TargetResponseTime > 1s, UnHealthyHostCount > 0, HTTP 5xx rate > 1%
ECS ServiceContainer Insights enabled; alarm on RunningTaskCount < DesiredTaskCount
Lambda FunctionCloudWatch alarms: Errors > 0, Duration > 80% of timeout, Throttles > 0
S3 BucketCloudWatch alarm: 4xxErrors rate; S3 Storage Lens enabled
VPCFlow Logs enabled to S3 + CloudWatch; DNS query logging enabled
Any resourceAWS Config recorder running; Config rule: required-tags

CloudWatch Dashboard — Deployed by Terraform

# modules/monitoring/dashboards.tf

resource "aws_cloudwatch_dashboard" "operational" {
  dashboard_name = "enterprise-operations-${var.environment}"
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        properties = {
          title   = "EC2 CPU Utilization"
          metrics = [["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", var.asg_name]]
          period  = 300
          stat    = "Average"
        }
      },
      {
        type   = "metric"
        properties = {
          title   = "RDS Connections"
          metrics = [["AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", var.db_instance]]
          period  = 300
        }
      },
      {
        type   = "metric"
        properties = {
          title   = "ALB Response Time"
          metrics = [["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", var.alb_arn_suffix]]
          period  = 60
          stat    = "p99"
        }
      }
    ]
  })
}

SNS Topics for Alert Routing

TopicSubscribersTriggers
infra-alerts-criticalPagerDuty / OpsGenie integrationEC2 status check failed, RDS failover, ECS task crash loop
infra-alerts-warningTeam Slack channel + emailCPU > 85%, memory > 90%, disk > 80%
security-alertsSecurity team email + SIEMGuardDuty HIGH/CRITICAL, Security Hub CRITICAL
cost-alertsFinance + account owner emailBudget threshold breach, anomaly detection
tag-violationsTeam lead emailConfig rule: non-compliant (missing tags)

Additional Best Practices & Considerations

Security Hardening

  • [ ] Enable MFA on all human IAM Identity Center accounts — enforce in permission set policies
  • [ ] Rotate credentials — no long-lived access keys; use IAM roles and short-lived STS tokens
  • [ ] Enable AWS Private CA if you need internal TLS certificates at scale
  • [ ] VPC endpoints for S3, DynamoDB, CloudWatch, KMS, STS — avoid sending traffic over the internet
  • [ ] IMDSv2 required — SCP blocks EC2 launches without HttpTokens = required
  • [ ] EBS default encryption — enable account-level default EBS encryption with KMS
  • [ ] S3 Block Public Access — enabled at the organization level

Networking

  • [ ] Use AWS VPC IPAM for centralized CIDR management — prevents overlaps
  • [ ] DNS resolution — Route 53 Private Hosted Zones shared via RAM; Resolver rules for on-prem
  • [ ] Egress inspection — AWS Network Firewall in the network hub for all outbound traffic
  • [ ] No public subnets in workload accounts (except for ALBs) — use NAT Gateways or centralized egress

Operational Readiness

  • [ ] AWS Trusted Advisor — enable organizational view; remediate HIGH findings
  • [ ] AWS Health — enable organizational Health events; EventBridge rules for automated response
  • [ ] Patch management — Systems Manager Patch Manager with maintenance windows
  • [ ] Golden AMI pipeline — EC2 Image Builder → test → approve → share via AWS RAM
  • [ ] Backup strategy — AWS Backup with organization-wide backup policies; periodic restore tests
  • [ ] Disaster recovery — define RPO/RTO per workload tier; implement pilot light or warm standby for critical workloads

Compliance & Audit

  • [ ] AWS Audit Manager — continuous evidence collection for SOC 2, HIPAA, PCI
  • [ ] AWS Artifact — download AWS compliance reports (SOC, ISO, PCI)
  • [ ] Well-Architected Tool — schedule quarterly Well-Architected Reviews per workload
  • [ ] Config Conformance Packs — deploy pre-built rule sets for specific compliance frameworks

Developer Experience

  • [ ] Service Catalog — pre-approved resource templates so developers don't need to know Terraform
  • [ ] Sandbox accounts — low-friction experimentation with budget caps and auto-cleanup (aws-nuke on schedule)
  • [ ] Amazon Q Developer — AI-assisted coding and cloud operations
  • [ ] Self-service account vending — AFT or CfCT for teams to request new accounts via PR

Cost Optimization

  • [ ] Savings Plans — Compute Savings Plans for predictable EC2/Fargate/Lambda usage
  • [ ] Reserved Instances — for stable RDS and ElastiCache workloads
  • [ ] Spot Instances — for batch processing, CI/CD build agents, and fault-tolerant workloads
  • [ ] S3 Intelligent-Tiering — for buckets with unpredictable access patterns
  • [ ] Right-sizing — AWS Compute Optimizer recommendations reviewed monthly
  • [ ] Unused resource cleanup — Lambda function scans for unattached EBS volumes, idle EC2, unused EIPs

Phased Rollout Plan

Phase 1: Foundation (Week 1–2)

TaskServices
Enable AWS Organizations + Control TowerOrganizations, Control Tower, IAM Identity Center
Create core OUs and accounts (Security, Infrastructure)Log Archive, Security Tooling, Audit, Network Hub
Set up IAM Identity Center with corporate IdPIAM Identity Center, SAML federation
Apply baseline SCPs (deny root, restrict regions, IMDSv2)Organizations SCPs
Enable org-wide CloudTrail to Log ArchiveCloudTrail, S3, KMS
Enable AWS Config with aggregatorAWS Config
Activate cost allocation tagsBilling, Tag Policies

Phase 2: Networking & Security (Week 2–3)

TaskServices
Deploy Transit Gateway in Network HubTransit Gateway, RAM
Configure VPC IPAM for CIDR managementVPC IPAM
Deploy Network Firewall for egress inspectionNetwork Firewall
Set up Route 53 Private Hosted Zones + ResolverRoute 53
Provision Direct Connect (primary) + Site-to-Site VPN (backup)Direct Connect, VPN
Configure Transit Gateway route tables (prod, non-prod, shared)Transit Gateway
Enable GuardDuty (delegated admin in Security Tooling)GuardDuty
Enable Security Hub with aggregationSecurity Hub
Deploy Config Rules + tag compliance rulesAWS Config
Set up CUR + Budgets + Anomaly DetectionBilling, CUR, Budgets

Phase 3: DevTools & CI/CD (Week 3–4)

TaskServices
Provision CI/CD account in DevTools OUControl Tower Account Factory
Build application CI/CD pipelineCodePipeline, CodeBuild, CodeDeploy
Build infrastructure CI/CD pipelineCodePipeline, CodeBuild, Terraform
Set up ECR and CodeArtifactECR, CodeArtifact
Create cross-account deploy roles in workload accountsIAM
Deploy tag validation step in pipelinesCodeBuild
Deploy monitoring-as-code modulesCloudWatch, SNS
Configure EventBridge alerting rules (root login, SG changes, GuardDuty)EventBridge, SNS
Set up AWS Chatbot for Slack/Teams notificationsAWS Chatbot

Phase 4: Workloads (Week 4–5)

TaskServices
Provision workload accounts (Dev, Staging, Prod)AFT or CfCT
Deploy VPCs via IaC into each workload accountVPC, Subnets, TGW attachments
Provision developer sandbox accountsSandbox OU with SCPs + budget caps
Run first Well-Architected ReviewWell-Architected Tool

Phase 5: Multi-Region & DR (Week 5–6)

TaskServices
Deploy Transit Gateway in DR region + inter-region peeringTransit Gateway
Configure Direct Connect Gateway association with DR TGWDirect Connect
Deploy Aurora Global Database for Tier 1 workloadsAurora
Set up S3 cross-region replication for critical bucketsS3 CRR
Configure Route 53 failover routing + health checksRoute 53
Set up AWS Backup cross-region vault copiesAWS Backup
Conduct DR failover test (GameDay)Fault Injection Service

Phase 6: Optimization & Hardening (Week 6–7)

TaskServices
Enable Compute Optimizer and right-sizing recommendationsCompute Optimizer
Purchase Savings Plans / Reserved InstancesCost Management
Deploy Audit Manager frameworksAudit Manager
Set up automated backup with restore testingAWS Backup
GameDay / chaos engineering exerciseFault Injection Service
QuickSight chargeback dashboardsQuickSight, Athena, CUR

Conclusion

Building an enterprise AWS landing zone is not a weekend project — it's a deliberate, multi-phase effort that establishes the foundation for everything that follows. The key principles:

  1. Multi-account isolation — separate workloads, environments, and functions into distinct accounts
  2. Governance as code — SCPs, tag policies, and Config rules managed in Terraform, not the console
  3. Logging everything — CloudTrail, Config, VPC Flow Logs, DNS, and access logs flowing to an immutable Log Archive
  4. Tag everything — mandatory tags enforced at four layers: IaC defaults, CI/CD validation, SCPs, and Config rules
  5. Monitor at deploy time — CloudWatch alarms, dashboards, and SNS topics created alongside every resource
  6. CI/CD for everything — applications and infrastructure deploy through pipelines with approval gates and audit trails
  7. Multi-region resilience — DR region with Transit Gateway peering, Aurora Global, Route 53 failover, and tiered RPO/RTO
  8. Hybrid connectivity — Direct Connect (primary) + VPN (backup) reaching both regions via a single DX Gateway
  9. Proactive alerting — EventBridge + SNS + AWS Chatbot for real-time notification of security events, infrastructure changes, and cost anomalies
  10. Well-Architected by design — every component maps to one or more of the six pillars

Start with Control Tower + Terraform AFT. Build the foundation right, and every team — infrastructure, developers, security — has a governed, observable, cost-transparent environment to operate in.


References: