Skip to content

How to automatically detect and clean up all AWS resources across multiple student accounts in Control Tower using Terraform?

-2

I work in a cloud lab rental company that provides temporary AWS environments for students to practice and learn. We use AWS Control Tower for multi-account management with the following architecture:

  • AWS Control Tower with Management Account
  • Sandbox OU containing N number of student accounts (lab1, lab2, ...)
  • Each student account is assigned to one IAM Identity Center user
  • Students have AdministratorAccess permission set within their isolated account
  • Each student can only see and access their own account (isolated environment)

After a student's lab session ends, we need to manually trigger a cleanup process that detects and deletes ALL AWS resources they created to avoid ongoing charges. Students may use various services (EC2, S3, RDS, Lambda, DynamoDB, SageMaker, etc.).

Requirements

  1. Comprehensive service coverage: Scan and delete resources from 25+ AWS services
  2. Exclude AWS defaults: Skip default VPCs, security groups, route tables, and Control Tower-managed resources
  3. Terraform-based: Use Infrastructure as Code for consistency and version control
  4. Isolated execution: Clean each student account independently without affecting others
  5. Manual trigger: Cleanup initiated on-demand, not automatically scheduled

Current Approach:

I've developed a Python + Terraform solution:

  1. Python detection script(detect_resources.py):
  • Uses boto3 with AWS CLI SSO profiles
  • Detects resources from EC2, S3, RDS, Lambda, DynamoDB, VPC components, etc.
  • Filters out AWS default resources (default VPC, security groups, Control Tower stacks)
  • Outputs detected_resources.json with all findings
  1. Terraform cleanup orchestration:
  • Reads detected_resources.json
  • Uses null_resource with local-exec to invoke Python deletion script
  • Handles resource dependencies (e.g., deletes NAT gateways before VPCs)
  • Generates cleanup reports
  1. Python deletion script(delete_resources.py):
  • Deletes resources in dependency order
  • Handles API rate limits and eventual consistency
  • Provides progress reporting

Questions

  1. Is this architecture approach recommended for multi-account cleanup in Control Tower environments?

  2. Are there native AWS services or features better suited for this use case?

    • AWS Config Rules with remediation actions?
    • AWS Systems Manager Automation?
    • Service Catalog with lifecycle management?
    • AWS nuke?
  3. How can I ensure student accounts are truly "clean" and ready for the next user?

    • Are there hidden resources or costs I should watch for?
    • How to handle resources with deletion protection enabled?
    • What about resources in "deleting" state that may fail?
  4. Performance considerations:

    • Scanning 70 accounts across multiple services
    • How to optimize for faster cleanup?
    • Best practices for handling API throttling at scale?
  5. Security and compliance:

    • How to maintain audit trails of what was deleted?
    • CloudTrail log retention recommendations for cleanup operations?

Any guidance on best practices or AWS-recommended approaches for this use case would be greatly appreciated.

2 Answers
2

Your Python + Terraform approach is functional, but there are several considerations and alternative approaches that may better suit your multi-account cleanup requirements in a Control Tower environment.

Architecture Assessment

Your current approach of using Python for detection and deletion with Terraform orchestration works, but mixing imperative scripts with declarative infrastructure-as-code creates complexity. Terraform excels at managing known infrastructure states, not discovering and destroying unknown resources created by students.

Recommended Alternatives

AWS Nuke is specifically designed for your use case - comprehensive resource cleanup across AWS accounts. It can scan and delete resources from 100+ AWS services, respects filtering rules to protect Control Tower resources, and handles dependency ordering automatically. However, it's not Terraform-based, which may conflict with your IaC requirement.

AWS Config with Remediation can identify non-compliant resources and trigger automated cleanup, but requires significant upfront configuration for each resource type and may not cover all 25+ services you need.

AWS Systems Manager Automation can orchestrate cleanup workflows across accounts, but you'd still need to write the detection and deletion logic yourself.

Best Practices for Your Approach

If continuing with your Python + Terraform solution:

1. Cross-Account Access Pattern Use IAM roles with assume_role instead of SSO profiles for programmatic access. Configure your Python scripts to assume a cleanup role in each student account from your management or dedicated automation account. This follows the principle of least privilege and provides better audit trails.

2. Scope Permissions Carefully The cross-account IAM role should have only the minimum permissions required for resource discovery and deletion. Avoid granting broad administrative access to the automation role.

3. Protect Terraform State Since you're managing resources across multiple accounts, treat your Terraform state as highly sensitive. Use encrypted S3 backend with DynamoDB locking and restrict access appropriately.

4. Handle Resource Dependencies Your approach of ordering deletions (NAT gateways before VPCs) is correct. Expand this to cover:

  • EC2 instances before their security groups
  • RDS instances before DB subnet groups
  • Lambda functions before their VPC configurations
  • ECS services before task definitions and clusters

5. Hidden Resources to Watch

  • EBS snapshots and AMIs created from student instances
  • CloudWatch Logs groups that persist after Lambda/ECS deletion
  • Elastic IPs not associated with instances (still incur charges)
  • EBS volumes in "available" state
  • S3 bucket versioning - delete all versions, not just current objects
  • RDS automated backups and manual snapshots
  • ECR repositories with container images
  • Secrets Manager secrets (have minimum 7-day recovery window)
  • KMS keys (scheduled for deletion, not immediate)

6. Deletion Protection Handling Before cleanup, your script should:

  • Disable termination protection on EC2 instances
  • Disable deletion protection on RDS instances, load balancers, and CloudFormation stacks
  • Remove delete markers from S3 buckets
  • Check for and handle resources with retention policies

7. Performance Optimization For 70 accounts:

  • Implement parallel processing with thread pools or async operations
  • Use pagination for all AWS API calls to handle large result sets
  • Implement exponential backoff for API throttling
  • Consider regional parallelization - scan all regions simultaneously per account
  • Cache IAM role credentials to reduce STS API calls

8. API Throttling at Scale

  • Implement retry logic with exponential backoff and jitter
  • Use AWS SDK built-in retry mechanisms
  • Consider spreading cleanup operations over time if not time-critical
  • Monitor CloudWatch metrics for API throttling events

9. Audit Trail Requirements

  • Enable CloudTrail in each student account with logs sent to a central S3 bucket in your management account
  • Tag all cleanup operations with identifiers (student ID, session ID, cleanup timestamp)
  • Generate detailed cleanup reports including: resources found, resources deleted, resources that failed deletion, and reasons for failures
  • Retain CloudTrail logs for at least 90 days for compliance and troubleshooting
  • Consider using CloudWatch Logs Insights for querying cleanup operations

10. Verification of Clean State After cleanup, run a verification scan to ensure:

  • No resources remain except AWS defaults and Control Tower managed resources
  • All regions are checked (students may create resources in non-primary regions)
  • Billing alerts are configured to catch any missed resources
  • Use AWS Cost Explorer API to verify account costs drop to near-zero within 24 hours

11. Failed Deletion Handling

  • Implement retry logic for resources stuck in "deleting" state
  • Set timeouts for deletion operations (e.g., 30 minutes per resource type)
  • Create a "quarantine" list of accounts with failed cleanups for manual review
  • Send notifications when cleanup fails or times out

Alternative Architecture Consideration

Instead of detecting and deleting resources, consider using AWS Organizations Service Control Policies (SCPs) to prevent resource creation in the first place, combined with AWS Service Catalog to provide pre-approved, automatically-cleaned infrastructure patterns. This shifts from reactive cleanup to proactive control.

For your specific use case with temporary lab environments, the most robust solution would be:

  1. Use AWS Organizations to create fresh accounts on-demand
  2. Assign them to students for the lab duration
  3. Close the accounts after the session ends (accounts enter a 90-day suspension period before permanent deletion)
  4. This guarantees zero residual costs and complete isolation

However, if account creation/closure overhead is too high, your Python + Terraform approach with the enhancements above will work, though it requires ongoing maintenance as AWS releases new services.
Sources
Community | Automating Multiple Environments with Terraform
Community | Automating Multiple Environments with Terraform
Govern permission sets for multiple accounts by using Account Factory for Terraform - AWS Prescriptive Guidance
Applying AWS Cloud WAN routing policy labels to cross-account attachments at creation time | AWS re:Post

answered 2 months ago
AWS
EXPERT
reviewed 2 months ago
2

I personally use AWS Nuke. However, it's important to note that it can only delete AWS resources in your Account that have already been defined as resource types in the project. This means that brand new AWS services and features are not always supported immediately. -> https://github.com/gruntwork-io/cloud-nuke

EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.