Querying cloudtrail logs in athena with partitions


Hello, I have currently setup cloudtrail logs to be query able in athena, however my create table looks something like

    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING,
    tlsDetails struct<
PARTITIONED BY (accountid string,region string,year string,month string,day string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

While I am able to query the table however it compulsorily needs me to pass the account id in the where clause. no matter I add date and region in the where clause. Is this a caveat? For example If I run the query: select * from cloudtrail_db.cloudtrail where year='2023' and month='11' and day='01' and region='ca-central-1' it gives me an error Amazon Athena experienced an internal error while executing this query. Please try submitting the query again and if the issue reoccurs, contact AWS support for further assistance. We apologize for the inconvenience. But if I run the same query as above along with account id it is able to query.

demandé il y a 6 mois322 vues
1 réponse


I could see that currently you have defined projection for “accountid” as below:

'projection.accountid.range'='0,9999999999999', 'projection.accountid.type'='integer',

Please note that in partition projection, Athena calculates partition values and locations using the table properties that you configure directly on your table in AWS Glue. If a projected partition does not exist in Amazon S3, Athena will still project the partition. In your case it is projecting the values from 0 till 9999999999999 for accountid which is causing unnecessary performance degradation and might be causing the issue here. The same gets fixed when you include the “accountid” column in your SELECT query.

I would kindly request you to consider the below suggestions to try and see if it fixes the issue:

  1. Restrict the range - For instance ‘projection.accountid.range’=‘111111111111,999999999999' considering all your accounts lies in the above range. You can further try reducing the range further as per your use-case.

  2. Use “Enum type” - If possible consider specifying all your accounts as members of an enumerated set

répondu il y a 5 mois

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions