- 최신
- 최다 투표
- 가장 많은 댓글
You could use AWS Lambda to extract metadata and store it in Amazon DynamoDB, then use AWS Glue to create a catalog and facilitate queries with Athena. For example you can get the metadata using s3.head_object
in boto3:
import boto3
# Initialize a session using Amazon S3
s3 = boto3.client('s3')
response = s3.head_object(Bucket='your-bucket-name', Key='path/to/your/object')
# Print the metadata
print(response['Metadata'])
# You can also access other metadata attributes, such as:
print("Size:", response['ContentLength'])
print("Last Modified:", response['LastModified'])
print("ETag:", response['ETag'])
Resources:
Also you can use S3 Inventory. This can provide CSV, ORC, or Parquet output files that list objects and their metadata on a daily or weekly basis for an S3 bucket. This inventory can include details like the object key, version ID, size, and last modified date. You can analyze these inventory files using Amazon Athena, AWS Glue, or other analytics tools.
Other Resources:
AWS Glue can crawl S3 data to generate metadata like file types and schemas. This metadata is stored in the AWS Glue Data Catalog. You can query the catalog to get metadata about S3 objects, like listing files by type. The metadata can also create Athena tables for SQL queries.
So in summary, Glue crawlers populate metadata in the Glue Data Catalog, which enables discovery and analysis of S3 data via metadata queries, Athena, and more.
Some sources:
I think we are talking about two different things. I understand that glue can crawl the data in s3 and creates metadata based on the data but this does not cover the s3 metadata that was created either by customer or added by default by s3 sdk. I was asking if glue can crawl the s3 metadata itself (not the data)
AWS Glue Crawler infers the schema of the data stored in Amazon S3, not the metadata associated with the S3 objects. To leverage your S3 metadata for further analysis, first you can use the 'HeadObject' API through AWS SDK or AWS CLI to extract the metadata of your S3 objects, then store the retrieved metadata in S3 and analyze the result using analytics services such as Athena directly from S3.
Thanks for the response. Lambda is definitely one of the options but I think it will be really expensive to use it per event (talking about moving couple of billion objects/day) . Using s3 Inventory looks like a better candidate here though. Does glue have support of metadata crawling like it does for data? That will make it easier to build a pipeline where only glue can be used with some sort of batch processing without adding one more dependency in the pipeline with datastores like ddb