EFS directories size calculation and lifecycle timer checking

0

Brief Overview We are using EFS (standard) which gets mounted to our ECS instances multiple times a day. Our lifecycle management policy is set to transfer data to IA after 1 day if not accessed. A while back, our storage usage of EFS hovered around 200-400GB out of which we only had about 5% in Standard Class and rest in IA. But for past 3 months, due to increased demand for storage for our jobs, our usage increased to about 5-7TB. There are 2 main issues we are facing.

Nature of Data Let me explain the nature of data as it will help in understanding the problem.

  • Large Files (30%): Around 30% (my rough guess) of our data consists of large files which may range from 100MBs up to 100s of GBs. These files may or may not be evenly distributed among all the directories or sub-directories of a directory.
  • Small Files (70%): Most of our data consists of small files ranging from a few bytes to a few megabytes. These files are for caching purposes and follow a simple structure for faster lookup. For example, cache for a record "ABC" will be stored in a directory named after the record's hash. So lets say if hash of "ABC" is "0f1122" in hex, it would be stored as such: Cache_Directory/0f/0f1122. Since these files are small in size, so we have millions of such files distributed over millions of directories. We also have different sets of caches so we have multiple cache directories for different purposes.

Life Cycle Management Policy We are using 1-day timer for moving data to IA and first access to move data out of IA.

Problems

  1. We have a rough idea of which directories in our EFS are taking up large portion of the space, but we don't have any way of checking which directory is consuming how much. I have tried mounting the EFS to an EC2 instance and running commands like du but it takes ages to even show anything for a directory that's not even that big. Is there an AWS service that can provide me with insights regarding which directories are using how much space.
  2. For past 2-3 months, around 60% of our storage space is sitting in Standard Zone (non-IA). Which is costing us a lot. We know that our workflows are not accessing that much of the data on daily basis but still, the data is not being moved to IA. Is there a way for me to check which directories are in standard zone and which are in IA zone so I can better optimize my storage and decrease the cost. It would be good to know which data is being regularly accessed to we can make adjustments accordingly. Right now, we are blindfolded.
  3. If I access metadata of a file, does it count towards the file being moved from IA to Standard zone? For example: A directory containing 100GB data residing in IA zone. If I check the size of the directory using a custom code or Linux commands, will it be moved to Standard Zone?
Ubaid
asked 8 months ago150 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions