I opened a an issue about this problem as well, https://github.com/awslabs/athena-glue-service-logs/issues/32
We have been attempting to partition our S3 access logs to make them easier to query and found the above tool to be a viable solution. However, we encountered a problem where certain S3 actions use the same request ID which causes the partitioned parquet logs to not contain the same data as in the input logs. They appear to be deduped out or overwritten.
Input log entries
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY BATCH.DELETE.OBJECT f1683055189142x766494105173435800/IMG_1138.jpeg - 204 - - - - - - - - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE - "POST /BUCKET_NAME/?delete HTTP/1.1" 200 - 305 - 29 - "-" "-" - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
The Athena query for the request ID returns the following:
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME 2023-05-02 19:19:55.000 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE POST /BUCKET_NAME/?delete HTTP/1.1 200 305 29 Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 - - 2023 05 02
The object being deleted is lost in the output format.
Due to the volume of S3 access logs, triggering a lambda on each object create is not a great option for us and we do not need the conversion to happen immediately. Has anyone worked around this problem or have a solution where a Glue job can properly output partitioned logs while not losing data?
I am going to be working on a Python script solution but I am curious if anyone has done this already.
The concern here is the contents of the output file. All the heavy lifting is done by the tool. It takes a source bucket of the S3 access logs and then processes the files into the partitioned prefixes and outputs parquet files. The input S3 access log has two entries for a delete but the output file only has 1 entry. The loss of the other entry is problematic since that entry has more relevant detail.
To troubleshoot a row lost you cannot do it that way (since it's lost), you would need to trace at which point of the pipeline is lost (possibly deduplicated), for instance printing the number of records in the data with that delete event from the moment the data is read till it is written
We are in agreement. My original question was to determine if others had already encountered this problem and had an answer. I have been attempting to debug the code and have been running into difficulties but I continue to troubleshoot.