Querying cloudtrail logs in athena with partitions

0

Hello, I have currently setup cloudtrail logs to be query able in athena, however my create table looks something like

CREATE EXTERNAL TABLE cloudtrail(
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>,
            ec2RoleDelivery:string,
            webIdFederationData:map<string,string>
        >
    >,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING,
    tlsDetails struct<
        tlsVersion:string,
        cipherSuite:string,
        clientProvidedHostHeader:string>
  )
PARTITIONED BY (accountid string,region string,year string,month string,day string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-controltower-logs-423167154025-ca-central-1/o-1vy7fz11og/AWSLogs/'
TBLPROPERTIES (
  'projection.accountid.range'='0,9999999999999', 
  'projection.accountid.type'='integer', 
  'projection.enabled'='true',
  'projection.region.type'='enum',
 'projection.region.values'='us-east-2,us-east-1,us-west-1,us-west-2,af-south-1,ap-east-1,ap-south-1,ap-northeast-3,ap-northeast-2,ap-southeast-1,ap-southeast-2,ap-northeast-1,ca-central-1,eu-central-1,eu-west-1,eu-west-2,eu-south-1,eu-west-3,eu-north-1,me-south-1,sa-east-1,eu-south-2',  
  'projection.year.format'='yyyy',
  'projection.year.interval'='1', 
  'projection.year.interval.unit'='YEARS',
  'projection.year.range'='2021,NOW',
  'projection.year.type'='date',
  'projection.month.format'='MM',
  'projection.month.interval'='1', 
  'projection.month.interval.unit'='MONTHS',
  'projection.month.range'='01,12',
  'projection.month.type'='integer',
  'projection.day.format'='dd',
  'projection.day.interval'='1',
  'projection.day.interval.unit'='DAYS',
  'projection.day.range'='01,31', 
  'projection.day.type'='integer',
  'projection.day.digits'='2',
 'storage.location.template'='s3://aws-controltower-logs-423167154025-ca-central-1/o-1vy7fz11og/AWSLogs/${accountid}/CloudTrail/${region}/${year}/${month}/${day}')

While I am able to query the table however it compulsorily needs me to pass the account id in the where clause. no matter I add date and region in the where clause. Is this a caveat? For example If I run the query: select * from cloudtrail_db.cloudtrail where year='2023' and month='11' and day='01' and region='ca-central-1' it gives me an error Amazon Athena experienced an internal error while executing this query. Please try submitting the query again and if the issue reoccurs, contact AWS support for further assistance. We apologize for the inconvenience. But if I run the same query as above along with account id it is able to query.

ramsa
질문됨 4달 전272회 조회
1개 답변
0

Hello,

I could see that currently you have defined projection for “accountid” as below:

'projection.accountid.range'='0,9999999999999', 'projection.accountid.type'='integer',

Please note that in partition projection, Athena calculates partition values and locations using the table properties that you configure directly on your table in AWS Glue. If a projected partition does not exist in Amazon S3, Athena will still project the partition. In your case it is projecting the values from 0 till 9999999999999 for accountid which is causing unnecessary performance degradation and might be causing the issue here. The same gets fixed when you include the “accountid” column in your SELECT query.

I would kindly request you to consider the below suggestions to try and see if it fixes the issue:

  1. Restrict the range - For instance ‘projection.accountid.range’=‘111111111111,999999999999' considering all your accounts lies in the above range. You can further try reducing the range further as per your use-case.

  2. Use “Enum type” - If possible consider specifying all your accounts as members of an enumerated set

AWS
Ankur_J
답변함 4달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인