Querying cloudtrail logs in athena with partitions


Hello, I have currently setup cloudtrail logs to be query able in athena, however my create table looks something like

    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING,
    tlsDetails struct<
PARTITIONED BY (accountid string,region string,year string,month string,day string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

While I am able to query the table however it compulsorily needs me to pass the account id in the where clause. no matter I add date and region in the where clause. Is this a caveat? For example If I run the query: select * from cloudtrail_db.cloudtrail where year='2023' and month='11' and day='01' and region='ca-central-1' it gives me an error Amazon Athena experienced an internal error while executing this query. Please try submitting the query again and if the issue reoccurs, contact AWS support for further assistance. We apologize for the inconvenience. But if I run the same query as above along with account id it is able to query.

posta 4 mesi fa261 visualizzazioni
1 Risposta


I could see that currently you have defined projection for “accountid” as below:

'projection.accountid.range'='0,9999999999999', 'projection.accountid.type'='integer',

Please note that in partition projection, Athena calculates partition values and locations using the table properties that you configure directly on your table in AWS Glue. If a projected partition does not exist in Amazon S3, Athena will still project the partition. In your case it is projecting the values from 0 till 9999999999999 for accountid which is causing unnecessary performance degradation and might be causing the issue here. The same gets fixed when you include the “accountid” column in your SELECT query.

I would kindly request you to consider the below suggestions to try and see if it fixes the issue:

  1. Restrict the range - For instance ‘projection.accountid.range’=‘111111111111,999999999999' considering all your accounts lies in the above range. You can further try reducing the range further as per your use-case.

  2. Use “Enum type” - If possible consider specifying all your accounts as members of an enumerated set

con risposta 4 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande