By using AWS re:Post, you agree to the Terms of Use
/Athena query analysis all data, rather than specified columns/

Athena query analysis all data, rather than specified columns

0

I've created an Athena table to analyse ALB logs stored in s3 using the recommended create table command in the AWS docs. I'm running some queries on this table to retrieve data from specific columns. However, when I look at the data analysed, it appears the data used in the query amount the the total data in the logs files. These log files contain many columns that I am not referencing in my query. I thought if I am only retrieving data from specific columns, that it would only search the data from those columns, and therefore the data used in the query would be less? Is there something I'm missing here?

Query:

SELECT FROM_UNIXTIME(FLOOR(TO_UNIXTIME(parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z'))/60)*60) as time, ssl_protocol, count(ssl_protocol) as total FROM alb_logs WHERE ssl_protocol LIKE '%' AND request_verb = 'POST' AND REGEXP_LIKE(request_url, 'https:\/\/myapp.*\/v3\/me\/order.*') AND elb_status_code = 200 AND parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z') BETWEEN TIMESTAMP '2021-12-06 12:12:44' AND TIMESTAMP '2022-01-05 12:12:44' GROUP BY 1, ssl_protocol ORDER BY 1 limit 20

asked 4 months ago25 views
2 Answers
2
Accepted Answer

Hi,

no you are not missing anything, the behaviour you are seeing is normal, the ALB log files are compressed text files, and the table used in the documentation you mentioned is not partitioned.

This means that to find the rows you need in your query Athena still needs to open and scan all file to make sure there are no rows needed by the query in that file. As for the column the log file stores all columns for a specific row (it is just a text string) in the same file; so to extract the columns you need Athena still need to scan all files.

Now, if you want to optimize the cost of your query you have some options:

  1. modify the Athena table and introduce Partition Projection using a new partition column for the date (based on the documentation the log files are actually stored with a prefix structure that include yyyy/mm/dd . for example:
CREATE EXTERNAL TABLE IF NOT EXISTS alb_log_partition_projection (
type string,
time string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code string,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string,
domain_name string,
chosen_cert_arn string,
matched_rule_priority string,
request_creation_time string,
actions_executed string,
redirect_url string,
lambda_error_reason string,
target_port_list string,
target_status_code_list string,
classification string,
classification_reason string
)
PARTITIONED BY(mydate STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' =
'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
LOCATION 's3://yourS3bucket/AWSLogs/youraccount/elasticloadbalancing/us-east-1/'
TBLPROPERTIES (
'has_encrypted_data'='false',
"projection.enabled" = "true",
"projection.mydate.type" = "date",
"projection.mydate.format" = "yyyy/MM/dd",
"projection.mydate.range" = "2021/01/01,NOW",
"projection.datehour.interval" = "1",
"projection.datehour.interval.unit" = "DAY",
"storage.location.template" = "s3://yourS3bucket/AWSLogs/youraccount/elasticloadbalancing/us-east-1/${mydate}"
)
  1. if you want to further optimize the cost of your queries and also performance , you could use a CTAS query in Athena to create an new table in Apache Parquet format (highly compressed and columnar format) and define the partitions you need (probably just date) . In this case not only your WHERE clause will limit the files to scan based on the time interval you are interested in , but it will only scan the column you are interested in. For Example:
CREATE TABLE alb_parquet3
WITH (
      format = 'Parquet',
      write_compression = 'SNAPPY',
      partitioned_by = ARRAY['mydate'],
      external_location = 's3://yourS3bucket/data-lake/alb-pqt/')
as select * from "alb_log_partition_projection" 

or

CREATE TABLE alb_parquet3
WITH (
      format = 'Parquet',
      write_compression = 'SNAPPY',
      partitioned_by = ARRAY['mydate'],
      external_location = 's3://yourS3bucket/data-lake/alb-pqt/')
as select *, substr(time,1,10) as mydate from "alb_logs" 

with approach 2, you will then need to schedule a batch to insert data in the new table. Having partitioned the table by day you will need to do the insert once a day to avoid duplicate. Otherwise you can make the process to convert the data a bit more complex and you could drop the partition for the current day, delete the prefix for the current date from S3 and finally call the insert with a where clause on data only for the current day. In this case you could even run the incremental load process every few hours or every few minutes.

In this case having the original table with partition projection allows you to scan less data during the incremental process.

you can also check this knowledge base article.

hope this helps

EXPERT
answered 4 months ago
  • Am I right in thinking that if option 2 is chosen, option 1 in no longer needed? Thanks!

  • Hi Thanks for the thorough answer, that's helpful. About option 2, when you say "define the partitions you need (probably just date)" , by date, do you mean the actual column in the original table (in this case, named 'time'), or do you mean the 'date' partition that was created in option 1? If you mean the latter, I'm not sure I understand how this can be utilised when using CREATE TABLE AS?

  • @AWS-User-0601396, If you use the CTAS to create the partitioned Parquet table you are right , because the partition will be in Hive style. (you will need to consider a batch process to insert the new partitions once a day.

    If you have another table in parquet format but with prefixes like yyyy/month/day you could actually combine the 2 methods.

1

The ALB access logs are in fact stored as text files and not in columnar format such as Parquet. As a result, the amount of data scanned is independent from number of columns in the Select statement. To enhance your query performance (and reduce data scanned), an option would be implementing partition projection with Amazon Athena - https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/

With this approach, you will also need to rewrite the query with the Where clause using the "year", "month" and "day" partitions/columns in addition of "time".

answered 4 months ago
EXPERT
reviewed 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions