Resolution
Amazon S3 stores server access logs as objects in an S3 bucket.
Prerequisite:
For the logs stored in your Amazon S3 bucket, define date-based partitioning for your S3 structure. In the following resolution example, the table is a partitioned table and the partitioned column is timestamp. For your S3 location, include a folder structure (/YYY/MM/DD) to be mapped to the timestamp column.
To use Athena to analyze S3 query server access logs, complete the following steps:
-
Turn on server access logging for your S3 bucket, if you haven't already. Note the values for Target bucket and Target prefix. You need both values to specify the Amazon S3 location in an Athena query.
Note: For log object key format, choose date-based partitioning to speed up analytics and query applications.
-
Open the Amazon Athena console.
Note: Before you run your first query, set up a query result location in Amazon S3.
-
In the Query editor, run a DDL statement to create a database.
Note: It's a best practice to create the database in the same AWS Region as your S3 bucket.
create database s3_access_logs_db
-
Create a table schema in the database. In the following example, the STRING and BIGINT data type values are the access log properties. You can query these properties in Athena. For LOCATION, enter the S3 bucket and prefix path from Step 1. Make sure to include a forward slash (/) at the end of the prefix (for example, s3://doc-example-bucket/prefix/). If you don't use a prefix, then include a forward slash (/) at the end of the bucket name, for example, s3://doc-example-bucket/.
CREATE EXTERNAL TABLE s3_access_logs_db.mybucket_logs( `bucketowner` STRING,
`bucket_name` STRING,
`requestdatetime` STRING,
`remoteip` STRING,
`requester` STRING,
`requestid` STRING,
`operation` STRING,
`key` STRING,
`request_uri` STRING,
`httpstatus` STRING,
`errorcode` STRING,
`bytessent` BIGINT,
`objectsize` BIGINT,
`totaltime` STRING,
`turnaroundtime` STRING,
`referrer` STRING,
`useragent` STRING,
`versionid` STRING,
`hostid` STRING,
`sigv` STRING,
`ciphersuite` STRING,
`authtype` STRING,
`endpoint` STRING,
`tlsversion` STRING,
`accesspointarn` STRING,
`aclrequired` STRING)
PARTITIONED BY (
`timestamp` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/prefix-name/account-id/region/source-bucket-name/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.timestamp.format'='yyyy/MM/dd',
'projection.timestamp.interval'='1',
'projection.timestamp.interval.unit'='DAYS',
'projection.timestamp.range'='2024/01/01,NOW',
'projection.timestamp.type'='date',
'storage.location.template'='s3://bucket-name/prefix-name/account-id/region/source-bucket-name/${timestamp}')
-
In the left pane, under Tables, choose the ellipsis next to the table name, and then choose Preview table. If you see data from the server access logs in the Results window, then you successfully created the Athena table. The data contains values such as bucketowner, bucket_name, and requestdatetime.
Example queries
To find the request for a deleted object, use the following query:
SELECT *
FROM s3_access_logs_db.mybucket_logs
WHERE key = 'images/picture.jpg'
AND operation like '%DELETE%';
To show Amazon S3 request IDs for requests that resulted in 403 Access Denied errors, use the following query:
SELECT requestdatetime, requester, operation, requestid, hostid
FROM s3_access_logs_db.mybucket_logs
WHERE httpstatus = '403';
To find Amazon S3 request IDs for HTTP 5xx errors in a specific time period with the key and error code, use the following query:
SELECT requestdatetime, key, httpstatus, errorcode, requestid, hostid
FROM s3_access_logs_db.mybucket_logs
WHERE httpstatus like '5%'AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show who deleted an object and when with the timestamp, IP address, and AWS Identity and Access Management (IAM) role, use the following query:
SELECT requestdatetime, remoteip, requester, key
FROM s3_access_logs_db.mybucket_logs
WHERE key = 'images/picture.jpg'
AND operation like '%DELETE%';
To show all operations that an IAM role performs, use the following query:
SELECT *
FROM s3_access_logs_db.mybucket_logs
WHERE requester='arn:aws:iam::123456789123:user/user_name';
To show all operations that were performed on an object in a specific time period, use the following query:
SELECT *
FROM s3_access_logs_db.mybucket_logs
WHERE Key='prefix/images/picture.jpg'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show how much data is transferred to an IP address for a specific time period, use the following query:
SELECT coalesce(SUM(bytessent), 0) AS bytessentTOTAL
FROM s3_access_logs_db.mybucket_logs
WHERE RemoteIP='1.2.3.4'AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show all expire operations that lifecycle rules performed in a specific time period, use the following query:
SELECT * FROM s3_access_logs_db.mybucket_logs
WHERE operation = 'S3.EXPIRE.OBJECT' AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To count the number of objects that expired in a specific time period, use the following query:
SELECT count(*) as ExpireCount
FROM s3_access_logs_db.mybucket_logs
WHERE operation = 'S3.EXPIRE.OBJECT'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show all transition operations that lifecycle rules performed in a specific time period, use the following query:
SELECT *
FROM s3_access_logs_db.mybucket_logs
WHERE operation like 'S3.TRANSITION%'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show all requesters grouped by Signature Version, use the following query:
SELECT requester, Sigv, Count(Sigv) as SigCount
FROM s3_access_logs_db.mybucket_logs
GROUP BY requester, Sigv;
To show all anonymous requesters that made requests in a specific time period, use the following query:
SELECT bucket_name, requester, remoteip, key, httpstatus, errorcode, requestdatetime
FROM s3_access_logs_db.mybucket_logs
WHERE requester IS NULL AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show all requesters that sent PUT object requests in a specific time period, use the following query:
SELECT bucket_name, requester, remoteip, key, httpstatus, errorcode, requestdatetime
FROM s3_access_logs_db.mybucket_logs
WHERE operation='REST.PUT.OBJECT'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show all requesters that sent GET object requests in a specific time period, use the following query:
SELECT bucket_name, requester, remoteip, key, httpstatus, errorcode, requestdatetime
FROM s3_access_logs_db.mybucket_logs
WHERE operation='REST.GET.OBJECT'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
To show all requesters ordered by highest turnaround time in a specific time period, use the following query:
SELECT * FROM s3_access_logs_db.mybucket_logs
WHERE NOT turnaroundtime='-'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
ORDER BY CAST(turnaroundtime AS INT) DESC;
Note: Because the number of objects within S3 buckets increases over time, queries process more data over time. It's a best practice to create a lifecycle policy for your server access logs bucket. Configure the lifecycle policy to periodically remove log files. This policy reduces the amount of data that Athena analyzes for each query.
Related information
Analyzing Amazon S3 server access logs using Amazon OpenSearch Service
Amazon S3 server access log format
Querying AWS service logs
Querying access logs for requests by using Amazon Athena
Considerations and limitations