- Newest
- Most votes
- Most comments
You are right to note that this architecture will require additional processing components of an additional S3 bucket and an ETL job for transforming the data, which adds some complexity. I outline below the potential benefits and use case where this is applicable, and also a method a query the access logs directly.
This solution is meant for those who have set up access logs without including the 'Date-based partitioning'. These is a solution for querying non partitions access logs by creating an Athena table and pointing it to the non-parititioned access logs, but adding this ETL processes aims to help users add partitions to reduce costs (as Athena charges by amount of data scanned) and increase security (by separating raw data form transformed / analytical data). [1][2][3][4]
By separating your raw data from your analytical data / transformed data (which are separate buckets typically, but are the same final bucket in this case), gain several advantages such as the ability to restrict access to the raw data while allowing access to analytical data, and also gaining the ability to generate a new curated analytical dataset from the original raw data because the raw data is preserved in an isolated S3 bucket; this can be useful if you decide to curate the data in your partitioned data through the ETL process but later determine you want to alter that data modification to all historical data. [1]
To your point, you can create an Athena table and query the data with the benefits of partitions if you select 'Date-based partitioning' when creating your access logs. However you would still not get the benefits mentioned above when separating raw data from the transformed / analytics ready data that you would get if you add an ETL process and a final analytics destination (which are the same bucket in this linked articles case). Because the generated parititions are not in Hive format such as would be generated if writting partitions from Glue, you would need to use Partition Projection to limit your queries as Athena can only load Hive partitions. [5][6]
Reference:
- [1] https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-lake-foundation.html
- [2] https://calculator.aws/#/createCalculator/Athena
- [3] https://docs.aws.amazon.com/athena/latest/ug/create-table.html
- [4] https://repost.aws/knowledge-center/analyze-logs-athena
- [5] https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerLogs.html#server-access-logging-overview
- [6] https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
To analyze S3 access logs without needing a separate ETL job or a secondary bucket, you can use AWS Glue with Amazon Athena directly on the logs in the original S3 bucket.
By using a schema-on-read approach, AWS Glue crawls and catalogs the logs, allowing queries through Athena without duplicating data. This approach leverages the Glue Data Catalog for cost and query optimization while avoiding extra storage and processing steps.
Relevant content
- asked 2 years ago
- asked 4 months ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago