Snapshots (and Volumes created from) Missing changes for a month


We have a EBS volume mounted on an EC2 instance. The volume gets daily snapshots created via CloudWatch Events. Subsequently we have ~2 years of daily snapshots for different volumes. However, when I recently went back to do a layered restore by creating a volume from the snapshots I found that there seems to be no changes in the created volume since Nov 8th. The data changes frequently every day but it seems the snapshots have not captured this. It's almost as if every snapshot captured no changes since Nov 8th until I started looking into the issue last week which included reboots, patches, and various manual recovery efforts.

Why would the snapshots not have changes since Nov 8th up to roughly December 10th? They are all there, all the status is okay, but creating a volume from any of those dailies will show me that a log that changes every day is still Nov 8th.

My next steps will be to attempt mounting these snapshot created volumes elsewhere and see if it has something to do with how they are being attached.

Another behavior we've not been able to explain is that users reported data loss for a month of data and the active live volume (where the snapshots come from) showed the similar no changes since Nov 8th. The Apache logs show activity but the data does not. The apache logs are stored on a separate volume from the data. This seems to be related but would mean our data volume has somehow been corrupted. During this time we've used our system and shared content - so we know the content made it to the system and was pulled by other users during this odd time window.

Any ideas? I'm starting to think we've lost data since Nov8 to about Dec10 with no log, alert, or other indication that anything was wrong. And at least for now moving forward - things appear to be working properly.

asked 3 years ago202 views
2 Answers

After more research this looks like on Dec10, it's as if AWS reverted to a snapshot from Nov8 and continued on. The app logs on the live volume show Entries up to Nov8 and then immediately jump to Dec10 as if all the days in between never existed. Another app using the same volume had the same issue. I'll have to go back but I suspect all the snapshots from Nov8 forward are duplicates until Dec10 at which point history resumed as if Nov8 was the day before Dec10.

Has anybody seen this before? I trolled through trying to see if somehow we did something that somehow restored the main volume from a Nov8 snapshot but nothing in CloudTrail Events show any odd behavior around either Nov8 or Dec10.

answered 3 years ago

After digging around a lot here's what I found:

A snapshot previously used and converted to a volume and attached inherited the same UUID as the live volume. To make a long story short, after some updates and a reboot to the system the system apparently picked the first UUID match for the entry in /etc/fstab. It turns out it mounted the recovery volume that was left attached instead of the proper live volume. Sometime later on a different update/reboot cycle the system was restarted with the recovery volume properly detached and it appeared there was a missing time slice of data.

To prevent this from happening again /etc/fstab has been updated to use devices instead of UUIDs. Now if we have a restore volume attached it won't be in conflict with the live volume. The chances of changing attachment devices in the AWS console are very low so that wouldn't be anywhere near of an issue like UUIDs are.

While AWS didn't indicate this in the support ticket, some of their feedback on logs and information led me to get this one figured out. Thanks to the AWS support crew for this effort.

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions