- Newest
- Most votes
- Most comments
Your Python + Terraform approach is functional, but there are several considerations and alternative approaches that may better suit your multi-account cleanup requirements in a Control Tower environment.
Architecture Assessment
Your current approach of using Python for detection and deletion with Terraform orchestration works, but mixing imperative scripts with declarative infrastructure-as-code creates complexity. Terraform excels at managing known infrastructure states, not discovering and destroying unknown resources created by students.
Recommended Alternatives
AWS Nuke is specifically designed for your use case - comprehensive resource cleanup across AWS accounts. It can scan and delete resources from 100+ AWS services, respects filtering rules to protect Control Tower resources, and handles dependency ordering automatically. However, it's not Terraform-based, which may conflict with your IaC requirement.
AWS Config with Remediation can identify non-compliant resources and trigger automated cleanup, but requires significant upfront configuration for each resource type and may not cover all 25+ services you need.
AWS Systems Manager Automation can orchestrate cleanup workflows across accounts, but you'd still need to write the detection and deletion logic yourself.
Best Practices for Your Approach
If continuing with your Python + Terraform solution:
1. Cross-Account Access Pattern Use IAM roles with assume_role instead of SSO profiles for programmatic access. Configure your Python scripts to assume a cleanup role in each student account from your management or dedicated automation account. This follows the principle of least privilege and provides better audit trails.
2. Scope Permissions Carefully The cross-account IAM role should have only the minimum permissions required for resource discovery and deletion. Avoid granting broad administrative access to the automation role.
3. Protect Terraform State Since you're managing resources across multiple accounts, treat your Terraform state as highly sensitive. Use encrypted S3 backend with DynamoDB locking and restrict access appropriately.
4. Handle Resource Dependencies Your approach of ordering deletions (NAT gateways before VPCs) is correct. Expand this to cover:
- EC2 instances before their security groups
- RDS instances before DB subnet groups
- Lambda functions before their VPC configurations
- ECS services before task definitions and clusters
5. Hidden Resources to Watch
- EBS snapshots and AMIs created from student instances
- CloudWatch Logs groups that persist after Lambda/ECS deletion
- Elastic IPs not associated with instances (still incur charges)
- EBS volumes in "available" state
- S3 bucket versioning - delete all versions, not just current objects
- RDS automated backups and manual snapshots
- ECR repositories with container images
- Secrets Manager secrets (have minimum 7-day recovery window)
- KMS keys (scheduled for deletion, not immediate)
6. Deletion Protection Handling Before cleanup, your script should:
- Disable termination protection on EC2 instances
- Disable deletion protection on RDS instances, load balancers, and CloudFormation stacks
- Remove delete markers from S3 buckets
- Check for and handle resources with retention policies
7. Performance Optimization For 70 accounts:
- Implement parallel processing with thread pools or async operations
- Use pagination for all AWS API calls to handle large result sets
- Implement exponential backoff for API throttling
- Consider regional parallelization - scan all regions simultaneously per account
- Cache IAM role credentials to reduce STS API calls
8. API Throttling at Scale
- Implement retry logic with exponential backoff and jitter
- Use AWS SDK built-in retry mechanisms
- Consider spreading cleanup operations over time if not time-critical
- Monitor CloudWatch metrics for API throttling events
9. Audit Trail Requirements
- Enable CloudTrail in each student account with logs sent to a central S3 bucket in your management account
- Tag all cleanup operations with identifiers (student ID, session ID, cleanup timestamp)
- Generate detailed cleanup reports including: resources found, resources deleted, resources that failed deletion, and reasons for failures
- Retain CloudTrail logs for at least 90 days for compliance and troubleshooting
- Consider using CloudWatch Logs Insights for querying cleanup operations
10. Verification of Clean State After cleanup, run a verification scan to ensure:
- No resources remain except AWS defaults and Control Tower managed resources
- All regions are checked (students may create resources in non-primary regions)
- Billing alerts are configured to catch any missed resources
- Use AWS Cost Explorer API to verify account costs drop to near-zero within 24 hours
11. Failed Deletion Handling
- Implement retry logic for resources stuck in "deleting" state
- Set timeouts for deletion operations (e.g., 30 minutes per resource type)
- Create a "quarantine" list of accounts with failed cleanups for manual review
- Send notifications when cleanup fails or times out
Alternative Architecture Consideration
Instead of detecting and deleting resources, consider using AWS Organizations Service Control Policies (SCPs) to prevent resource creation in the first place, combined with AWS Service Catalog to provide pre-approved, automatically-cleaned infrastructure patterns. This shifts from reactive cleanup to proactive control.
For your specific use case with temporary lab environments, the most robust solution would be:
- Use AWS Organizations to create fresh accounts on-demand
- Assign them to students for the lab duration
- Close the accounts after the session ends (accounts enter a 90-day suspension period before permanent deletion)
- This guarantees zero residual costs and complete isolation
However, if account creation/closure overhead is too high, your Python + Terraform approach with the enhancements above will work, though it requires ongoing maintenance as AWS releases new services.
Sources
Community | Automating Multiple Environments with Terraform
Community | Automating Multiple Environments with Terraform
Govern permission sets for multiple accounts by using Account Factory for Terraform - AWS Prescriptive Guidance
Applying AWS Cloud WAN routing policy labels to cross-account attachments at creation time | AWS re:Post
I personally use AWS Nuke. However, it's important to note that it can only delete AWS resources in your Account that have already been defined as resource types in the project. This means that brand new AWS services and features are not always supported immediately. -> https://github.com/gruntwork-io/cloud-nuke
