Skip to content

Amazon S3 Metadata: Full visibility and querying for all your S3 bucket objects

7 minute read
Content level: Advanced
2

Amazon S3 metadata is additional key-value information stored with your objects that enables enhanced organization, automation, and management without modifying the actual file content

Amazon S3 metadata provides a powerful way to store additional information about your objects alongside the actual data, enabling enhanced organization, automation, and management of your cloud storage. This metadata consists of system-defined properties (like object size and last modified date) that S3 automatically manages, as well as user-defined key-value pairs that you can customize to store application-specific information such as content types, processing status, or business tags. By leveraging S3 metadata effectively, you can build more intelligent data workflows, implement automated processing pipelines, and maintain better visibility into your stored objects without needing to access the actual file contents.

Amazon S3 Metadata now has expanded to provide complete, queryable visibility into the metadata for all existing objects in your Amazon S3 buckets, not just new or updated data. This powerful feature eliminates the need for expensive, custom-built scanning systems, significantly simplifying governance, auditing, cost optimization, and large-scale data analytics.

1. The challenge S3 Metadata feature solves

For customers with billions of objects, understanding their S3 storage footprint was historically a challenge. Existing methods required:

  • Custom systems: Building and maintaining complex, resource-intensive systems to scan objects, track changes, and manage metadata over time.

  • APIs cost: Relying on object-level APIs (like ListObjects or HeadObject) at scale may not be most cost efficient approach.

  • Inventory reports: Waiting for daily Amazon S3 Inventory reports may not fir to the use cases where immediacy required for real-time operations.

  • The new S3 Metadata feature solves this by providing fully managed, queryable tables that maintain a complete, up-to-date snapshot of your bucket's metadata.

2. Key components: The two managed table types

When enabled, AWS automatically creates and maintains two types of Apache Iceberg tables that you can query using standard SQL tools like Amazon Athena.

Table TypeFunctionUpdate FrequencyPrimary Use Cases
Live Inventory TableProvides a complete and current snapshot of all objects and their metadata, including existing, backfilled objects.Refreshed automatically within one hour of changes (uploads, deletions, updates).Cost analysis, compliance checks, data discovery, general inventory.
Journal TableProvides a near real-time log of object-level changes over time.Near real-time view of uploads, deletions, and metadata modifications.Auditing activity, tracking object lifecycle, monitoring security events.

3. Benefits of S3 metadata

This feature significantly improves how you manage and interact with large S3 datasets:

  • Governance & compliance: Instantly query to identify objects that violate policies (e.g., finding unencrypted objects or objects missing required tags).

  • Cost optimization: Analyze the distribution of object tags, storage classes, and sizes across your entire bucket to quickly identify and action cost-saving opportunities.

  • Faster analytics: Avoid waiting for metadata discovery before processing can begin.

  • Zero management overhead: AWS fully manages the underlying tables, including backfilling existing object data, compaction, and garbage collection.

4. How to enable S3 Metadata (step-by-step)

You can enable S3 Metadata for any General Purpose S3 bucket using the AWS S3 Console:

  1. Select your bucket: Navigate to the Amazon S3 console and choose the target bucket.

  2. Go to Metadata tab: Select the Metadata tab for your bucket.

  3. Create configuration: Choose Create metadata configuration.

  4. Configure tables:

  • Journal Table: It is automatically enabled. Configure optional Server-side encryption and set a Record expiration period (e.g., 365 days).

  • Live Inventory Table: Choose Enabled. Configure the desired Server-side encryption options.

  1. Create Configuration: Choose Create metadata configuration.

The system immediately starts the backfill process to populate the Live Inventory Table with metadata for all existing objects. The time taken depends on the quantity of objects in your bucket.

Once S3 Metadata is enabled, you'd need to enable bucket integration with AWS analytics services. This info message would show up:

   To analyze metadata tables in this Region with AWS query engines, you must first enable table bucket integration with AWS analytics services.

6. Click Enable integration.

5. Real-life query examples with Amazon Athena

Once the tables are created, you can query your data directly using Amazon Athena. The tables are accessible in the AWS Glue Data Catalog.

  1. to AWS Athena console

  2. From left pane, select Catalog as s3tablescatalog/aws-s3, select Database as b_<bucket-name>

(Note: Replace "your_bucket_inventory_table" and "your_bucket_journal_table" with the actual table names in your Athena console.)

1. Securiy and encryption complaince (Live Inventory Table)

Use CaseSQL Query Example
Find Unencrypted ObjectsSELECT key, encryption_status FROM "your_bucket_inventory_table" WHERE encryption_status = 'NONE';
Identify Objects Missing Required TagsSELECT key FROM "your_bucket_inventory_table" WHERE object_tags['Project'] IS NULL;
Objects encrypted with a specific KMS KeySELECT key, kms_key_arn FROM "your_bucket_inventory_table" WHERE kms_key_arn = 'arn:aws:kms:us-east-1:123456789012:key/your-specific-key-id';
Count of objects using SSE-S3 vs. SSE-KMS SELECT encryption_status, COUNT(*) AS object_count FROM "your_bucket_inventory_table" WHERE encryption_status != SSE-S3 vs. 'NONE' GROUP BY 1;

2. Storage efficiency and lifecycle planning (Live Inventory Table)

Use CaseSQL Query Example
Total size and count of objects by Storage ClassSELECT storage_class, COUNT(*) AS object_count, SUM(size) / 1073741824 AS total_size_gb FROM "your_bucket_inventory_table\ GROUP BY storage_class;
Find the top 10 largest objects in the bucketSELECT key, size / 1048576 AS size_in_mb, last_modified_date FROM "your_bucket_inventory_table" ORDER BY size DESC LIMIT 10;
Count of objects uploaded per day (for the last week)SELECT DATE(last_modified_date) AS upload_day, COUNT(*) AS object_count FROM "your_bucket_inventory_table" WHERE last_modified_date >= (current_date - interval '7' day) GROUP BY 1 ORDER BY 1 DESC;
Analyze Storage Class Usage by TagSELECT storage_class, object_tags['Department'], sum(size) / 1024 / 1024 / 1024 AS usage_in_gb FROM "your_bucket_inventory_table " GROUP BY storage_class, object_tags['Department'];
Find Objects Not Accessed in Over a Accessed in Over a YearSELECT key, last_modified_date FROM "your_bucket_inventory_table" WHERE last_modified_date < (current_date - interval '365'day);
Find Objects with Specific User MetadataSELECT key, user_metadata['source-system'] FROM "your_bucket_inventory_table" WHERE cardinality(user_metadata) > 0;
Find Large Objects Not Using Intelligent-TieringSELECT key, size, storage_class FROM "your_bucket_inventory_table" WHERE size > 1073741824 /* 1 GB */ AND storage_class != 'INTELLIGENT_TIERING';

3. Auditing and Change Tracking (Journal Table)

Use CaseSQL Query Example
Track Recently Deleted ObjectsSELECT key, last_modified_date, requester FROM "your_bucket_journal_table" WHERE last_modified_date >= (current_date - interval '7' day) AND record_type = 'DELETE';
Identify Source IPs for Recent CREATE RequestsSELECT source_ip_address, count(source_ip_address) FROM "your_bucket_journal_table" WHERE record_type = 'CREATE' GROUP BY source_ip_address ORDER BY 2 DESC;
Monitor Updates to Object TagsSELECT key, last_modified_date, version_id FROM "your_bucket_journal_table" WHERE record_type = 'METADATA_UPDATE';
Trace all operations (PUT, DELETE, METADATA_UPDATE) for a specific object keySELECT record_timestamp, record_type, requester, version_id FROM "your_bucket_journal_table" WHERE key = 'path/to/my/critical/file.json' ORDER BY record_timestamp ASC;
List the most active AWS principals (users/roles) in the last 24 hoursSELECT requester, COUNT(*) AS operation_count FROM "your_bucket_journal_table" WHERE record_timestamp >= (NOW() - interval '24' hour) GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
Identify S3 Lifecycle actions that deleted objects in the last monthSELECT key, version_id, record_timestamp FROM "your_bucket_journal_table" WHERE requester = 'Account-id' AND record_type = 'DELETE' AND DATE(record_timestamp) >= (current_date - interval '30' day);

By reading this article, you would have gained a complete understanding of how the expanded Amazon S3 Metadata feature revolutionizes large-scale data management. You now know that this tool eliminates the need for complex, custom metadata systems by providing two fully managed, queryable Apache Iceberg tables—the Live Inventory Table and the Journal Table. You would have learned the step-by-step process for enabling this feature in the AWS Console and how to immediately extract value using real-life SQL queries in Amazon Athena.

AWS
EXPERT
published a month ago219 views