partitioning s3 access logs to optimize athena queries
Hi,
When enabling s3 access logs on a bucket, there is this great feature that the generated logs can be queried through Athena. One thing I can however not figure out, is how to partition the external table so I can limit the data scanned e.g. by date.
The generated file structure is like
- s3://my-access-logs/2022-02-15-11-11-18-6E829E27FAAA289A
- s3://my-access-logs/2022-02-15-11-11-19-2870C85C7E0D2B20
- ..
- s3://my-access-logs/2022-02-16-11-01-01-6E829E27FAAA289A
- ...
I added a "PARTITIONED BY (dt string)" to my table creation DDL, and then added partition like
ALTER TABLE my_logs_table ADD PARTITION (dt = '2022-02-15') location 's3://my-access-logs/2022-02-15'
However, when querying the table no data is scanned, and no rows are returned (even if I don't specify a filter on dt in my query).
Most examples I could find always deal with s3 structures where there is a hierarchical folder structure, like 2022/02/15/xxx
Anybody has any solution for this?
Thanks, Bert
You need to create partitions, this can be achieved by writing a simple AWS Lambda code.
Event based serverless architecture: S3 Access Log Bucket (PUT event notification) --> SQS --> Lambda in batch mode of 10 (Parses file name and perform S3 copy while adding partition prefix)
For example: Input: s3://my-access-logs/2022-02-15-11-11-18-6E829E27FAAA289A Output:
- Option-1 s3://my-access-logs/pt_date=2022-02-15/2022-02-15-11-11-18-6E829E27FAAA289A
- Option-2 s3://my-access-logs/year=2022/month=02/day=15/hour=11/2022-02-15-11-11-18-6E829E27FAAA289A
Additionally, you can have aggressive s3 "Lifecycle Policy" to delete the original logs in 1 day to save on storage costs.
Once this is done, you can register Athena table with HIVE partitions and query it seamlessly.
Alternatively, there is this blog which utilized Glue to perform this in a batch mode - https://aws.amazon.com/blogs/big-data/easily-query-aws-service-logs-using-amazon-athena/
Relevant questions
partitioning s3 access logs to optimize athena queries
asked 3 months agohow to monitor each request and response in cloudfront?
asked 2 days agoNew s3 bucket, filling with files before anything was connected
Accepted Answerasked 4 months agoStoring Application Load Balancer access logs in a KMS-encrypted S3 bucket
asked 3 years agoRDS/Postgres logs to S3
Accepted Answerasked 3 years agoAWS ELB log analyzer
Accepted Answerasked 4 months agoS3 server access logs to Cloudwatch?
Accepted Answerasked 4 years agos3 no sign request didn't work
asked 4 months agoGrant Access to Control Tower created Cloudtrail S3 Bucket
asked 4 months agounable to access S3 log files owned by "s3-log-service"
asked 2 years ago
Athena needs partitions to be by folder. It cannot partition individual files by their name.