Skip to content

Rapid and scalable data recovery using Amazon S3 Versioning, with the Rollback Tool for Amazon S3 (an AWS open source sample)

6 minute read
Content level: Intermediate
2

Introducing the AWS Solutions Library Guidance for rolling back changes to datasets in Amazon S3

Organizations storing large datasets in Amazon S3 face a critical challenge: how to quickly recover from accidental deletions, overwrites, or unwanted changes that affect millions or billions of objects. Traditional restoration methods can take days or weeks for large-scale recovery operations, which may not meet the stringent Recovery Time Objectives (RTO) required for business-critical data. By only undoing changes, this tool can reduce recovery time and cost by orders of magnitude.

Amazon S3 is an object storage service with industry-leading scalability, data availability, security, performance, and 99.999999999% (11 9s) of data durability. S3 Versioning protects against accidental deletions and overwrites by keeping multiple variants of an object in the same S3 bucket, and placing a delete marker as the current version in response to simple DELETE requests (i.e. without specifying a VersionID). In this post, I'll summarize the benefits and capabilities of the tool, and you'll also learn why pre-deployment testing and approval are essential for rapid incident response. Additional detail is available in the tool's readme, and a demo video is available here.

Enterprise-Scale Recovery

The S3 Rollback Tool provides automated recovery for S3 datasets at any scale. The solution handles buckets ranging from thousands to billions of objects and addresses multiple types of undesired changes.

Business Benefits:

  • Minimize cost and business impact during data incidents
  • Accelerate and de-stress critical recovery operations
  • Improved compliance with RTO requirements

Core Capabilities

  • Efficient operations: Performs only minimum necessary operations to revert changes
  • Safe recovery: Never permanently deletes existing data - only manipulates delete markers and copies object versions
  • Precise targeting: Revert a bucket or prefix with one-second precision
  • Multiple modes: Multiple scenarios covered in the same tool
  • Multi-bucket orchestration: Roll back hundreds of buckets from a single CSV input

Performance Metrics

  • 10 million changes detected and reverted in under 1 hour, in a 1 billion-object bucket
  • 100 million changes reverted in under 5 hours
  • And for smaller buckets, thousands of changes in under 15 minutes end-to-end, including real-time inventory creation

Recovery Scenarios

The tool addresses three primary scenarios:

  1. Bucket Rollback Mode: Reverts non-overwrite PUTs, DELETE operations, and overwrite PUTs to revert bucket state to a specific point in time
  2. Delete Marker Removal Mode: Removes delete markers placed after a specified time, effectively "undeleting" objects
  3. Copy to Bucket Mode: Recreates dataset state at a specific point in time in a different bucket

Multi-Bucket Orchestration

For organizations that need to recover many buckets simultaneously, the orchestrator template (s3-rollback-orchestrator.yaml) deploys the rollback tool across hundreds of buckets from a single CSV input:

  • Step Functions Distributed Map launches one child stack per bucket (or per prefix), up to 50 in parallel
  • Shared IAM role keeps you within account quotas at scale
  • Comma-separated prefixes in a single CSV row each produce their own child stack
  • SNS failure notifications alert you if the orchestrator itself terminates abnormally
  • Automatic cleanup deletes all child stacks when the orchestrator stack is removed

Provide a CSV of bucket names, a rollback timestamp, and the orchestrator handles deployment, polling, results consolidation, and teardown. See orchestrator.md for the full walkthrough.

Technical Architecture

The tool operates through a workflow that leverages multiple AWS services:

  1. Analysis Phase: Uses Amazon Athena to analyze bucket inventory and identify changes requiring reversal
  2. Planning Phase: Generates S3 Batch Operations jobs for efficient scale processing
  3. Review Phase: Provides opportunity to review proposed changes before execution
  4. Execution Phase: Performs recovery operations and logs results

Prerequisites

Pre-Incident Preparation Accelerates Recovery

Practice deployment before incidents occur to minimize RTO. When data emergencies happen, approval processes and tool explanations consume valuable recovery time.

Preparation Steps:

  1. Enable Prerequisites: Ensure S3 Versioning is active on critical buckets, DeleteObjectVersion is denied for roles that do not need it (noting that this tool will need it), and S3 Lifecycle changes are tightly controlled, to minimize the risk of permanent data loss, and consider applying S3 Object Locks in compliance mode to your most critical data. Enable S3 Metadata on buckets with more than a million objects.
  2. Test Deployment: Deploy the CloudFormation template and run recovery scenarios with representative datasets
  3. Seek Organizational Approval: If you will need security and/or compliance team approval for production use, get it in advance
  4. Documentation: Create recovery procedures and train incident response teams
  5. Validation: Test with datasets that mirror production characteristics

Preparation can transform recovery operations from days to hours, reducing business impact and improving incident response effectiveness.

Considerations

It is important to bear in mind that many disaster and cyber recovery scenarios require that an independent copy of the full dataset exists, perhaps in another AWS Account or even region. The ability to rapidly recover in place does not remove the need to maintain independent backups of your most critical datasets.

Getting Started

The S3 Rollback Tool is deployed as an open-source solution through AWS CloudFormation, on an as-needed basis. The CloudFormation stack can be deleted once recovery is complete, and organizations can customize the solution for their specific requirements. Whether managing gigabytes or exabytes of data, the solution scales to meet your recovery needs. For multi-bucket recovery, the orchestrator template handles the coordination — provide a CSV and a timestamp, and it deploys and monitors child stacks across all your buckets in parallel.

Effective disaster recovery requires tested, approved plans ready for immediate deployment. By preparing ahead of incidents, organizations can transform their data resilience and confidently meet aggressive RTO targets.

For more information, and to download the tool, visit https://github.com/aws-solutions-library-samples/sample-s3-rollback-tool. We welcome your feedback below, and in the GitHub repository.