Querying cloudtrail logs in athena with partitions

0

Hello, I have currently setup cloudtrail logs to be query able in athena, however my create table looks something like

CREATE EXTERNAL TABLE cloudtrail(
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>,
            ec2RoleDelivery:string,
            webIdFederationData:map<string,string>
        >
    >,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING,
    tlsDetails struct<
        tlsVersion:string,
        cipherSuite:string,
        clientProvidedHostHeader:string>
  )
PARTITIONED BY (accountid string,region string,year string,month string,day string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-controltower-logs-423167154025-ca-central-1/o-1vy7fz11og/AWSLogs/'
TBLPROPERTIES (
  'projection.accountid.range'='0,9999999999999', 
  'projection.accountid.type'='integer', 
  'projection.enabled'='true',
  'projection.region.type'='enum',
 'projection.region.values'='us-east-2,us-east-1,us-west-1,us-west-2,af-south-1,ap-east-1,ap-south-1,ap-northeast-3,ap-northeast-2,ap-southeast-1,ap-southeast-2,ap-northeast-1,ca-central-1,eu-central-1,eu-west-1,eu-west-2,eu-south-1,eu-west-3,eu-north-1,me-south-1,sa-east-1,eu-south-2',  
  'projection.year.format'='yyyy',
  'projection.year.interval'='1', 
  'projection.year.interval.unit'='YEARS',
  'projection.year.range'='2021,NOW',
  'projection.year.type'='date',
  'projection.month.format'='MM',
  'projection.month.interval'='1', 
  'projection.month.interval.unit'='MONTHS',
  'projection.month.range'='01,12',
  'projection.month.type'='integer',
  'projection.day.format'='dd',
  'projection.day.interval'='1',
  'projection.day.interval.unit'='DAYS',
  'projection.day.range'='01,31', 
  'projection.day.type'='integer',
  'projection.day.digits'='2',
 'storage.location.template'='s3://aws-controltower-logs-423167154025-ca-central-1/o-1vy7fz11og/AWSLogs/${accountid}/CloudTrail/${region}/${year}/${month}/${day}')

While I am able to query the table however it compulsorily needs me to pass the account id in the where clause. no matter I add date and region in the where clause. Is this a caveat? For example If I run the query: select * from cloudtrail_db.cloudtrail where year='2023' and month='11' and day='01' and region='ca-central-1' it gives me an error Amazon Athena experienced an internal error while executing this query. Please try submitting the query again and if the issue reoccurs, contact AWS support for further assistance. We apologize for the inconvenience. But if I run the same query as above along with account id it is able to query.

ramsa
asked 5 months ago279 views
1 Answer
0

Hello,

I could see that currently you have defined projection for “accountid” as below:

'projection.accountid.range'='0,9999999999999', 'projection.accountid.type'='integer',

Please note that in partition projection, Athena calculates partition values and locations using the table properties that you configure directly on your table in AWS Glue. If a projected partition does not exist in Amazon S3, Athena will still project the partition. In your case it is projecting the values from 0 till 9999999999999 for accountid which is causing unnecessary performance degradation and might be causing the issue here. The same gets fixed when you include the “accountid” column in your SELECT query.

I would kindly request you to consider the below suggestions to try and see if it fixes the issue:

  1. Restrict the range - For instance ‘projection.accountid.range’=‘111111111111,999999999999' considering all your accounts lies in the above range. You can further try reducing the range further as per your use-case.

  2. Use “Enum type” - If possible consider specifying all your accounts as members of an enumerated set

AWS
Ankur_J
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions