Partition Projection for vpcflow logs in parquet with hive compatible prefixes

1

I am not sure if what I am asking for is possible or not, as I don't fully understand Athena and glue and all the nuances of how partition projection works. I am trying to switch all of our vpcflow logs from writing to CloudWatch Logs to writing directly to s3 in parquet format with hive compatible prefixes turned on. I have successfully gotten my parquet table and schema to work by loading the partitions using MSCK REPAIR TABLE. Initially I was going to call it done, but I'm worried about how long MSCK REPAIR TABLE will take to run to load new partitions and we are centralizing our vpcflow logs from many accounts into one s3 bucket. I also don't know if there is any cost implications the more objects in the bucket. So my thought was to use partition projection, which has the added benefit of not having to run a MSCK REPAIR TABLE on a scheduled basis. I haven't seen any examples of someone using partition projection with parquet formatted objects or hive compatible s3 prefixes.

Can someone confirm if this is possible? Should I not be using hive compatible prefixes in this scenario? Here is what my glue table looks like where I am trying to use partition projection. It is from a Cloudformation Template so the base bucket name is a param but should be enough information to examine where I am going wrong:

GlueTableVpcFlowParquet:
    Type: AWS::Glue::Table
    Properties:
      CatalogId:
        Ref: AWS::AccountId
      DatabaseName:
        Ref: GlueDatabase
      TableInput:
        Name:
          Fn::Join:
          - ''
          - - Ref: TableNameVPCFlowParquet
            - Ref: CIAppend
        TableType: EXTERNAL_TABLE
        Parameters:
          skip.header.line.count: '1'
          projection.enabled: 'true'
          projection.aws_account_id.type: injected
          projection.year.type: integer
          projection.year.range: 2022,9999
          projection.month.type: integer
          projection.month.range: 01,12
          # projection.month.digits: '2'
          projection.day.type: integer
          projection.day.range: 01,31
          projection.aws_region.type: enum
          projection.aws_region.values:
            Ref: TableRegions
          storage.location.template:
            Fn::Join:
            - ''
            - - Ref: TableLocationVpcFlowParquet
              - "/awslogs/aws-account-id=${aws_account_id}/aws-service=vpcflowlogs/aws-region=${aws_region}/year=${year}/month=${month}/day=${day}"
        PartitionKeys:
        - Name: aws_account_id
          Type: string
        - Name: aws_region
          Type: string
        - Name: year
          Type: int
        - Name: month
          Type: int
        - Name: day
          Type: int
        StorageDescriptor:
          Columns:
          - Name: version
            Type: int
          - Name: account_id
            Type: string
          - Name: interface_id
            Type: string
          - Name: srcaddr
            Type: string
          - Name: dstaddr
            Type: string
          - Name: srcport
            Type: int
          - Name: dstport
            Type: int
          - Name: protocol
            Type: bigint
          - Name: packets
            Type: bigint
          - Name: bytes
            Type: bigint
          - Name: start
            Type: bigint
          - Name: end
            Type: bigint
          - Name: action
            Type: string
          - Name: log_status
            Type: string
          Location:
            Fn::Join:
            - ''
            - - Ref: TableLocationVpcFlowParquet
              - ""
          InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
          SerdeInfo:
            SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
          BucketColumns: []
          SortColumns: []

Object prefixes in s3 look like this:

s3://<bucketName>/AWSLogs/aws-account-id=012345678912/aws-service=vpcflowlogs/aws-region=us-east-1/year=2022/month=11/day=10/012345678912_vpcflowlogs_us-east-1_fl-003e10087957dd87a_20221110T1850Z_33d78e4c.log.parquet

I run Athena search like this:

SELECT * FROM "iaasdatasources"."vpcflowlogsparquet" WHERE aws_account_id='012345678912' and aws_region='us-east-1' and year=2022 and month=11 and day=14 limit 10

It seems to run for 20 seconds but returns zero results and no errors.

Also if anyone has advice on doing a data_time partition projection instead of separate year month day partitions, that would be helpful. I wasn't sure what the format of should be considering the date locations in the path are hive formatted year=<year>/month=<month>/day=<day>

asked a year ago932 views
2 Answers
1

I understand that you want to make use of partition projection on your table. You can make use of partition projection on top of parquet data. There is no limitation mentioned as such. The format of partitions that you are using is the right one only. Athena only supports hive-style partitioning so the way that you are defining your partitions is correct.

Coming to the issue that you are facing, you are unable to view any data when you run the particular query. Please have a look at this documentation which contains the considerations and limitations while making use of partitions. The Athena query will not return any rows in case the projected partition does not exist in the s3 bucket. But, seems like you are doing that correctly as well.

I see that your s3 bucket consists of another partition 'aws-service' which is not registered in your Glue table as a partition key. Thus, please try adding that and check whether the query runs or not.

Please go through this documentation about the supported data types that can be implemented for partition projection. You can make use of date type for partition projection data type.

I hope that this information would be of use to you! If you require in-depth investigation and analysis of any issue that you are facing in Athena then I would suggest you to raise a ticket with Athena technical support team.

profile pictureAWS
SUPPORT ENGINEER
Chaitu
answered a year ago
  • Thanks for your response @chaitu. Related to what you mentioned about not having 'aws-service' as a partition, I did add it as a partition as well as got rid of the template storage location since partition projection should recognize the column values automatically without being told where they are. With those changes I am not getting back data as expected! But I am still unclear on how to format a date partition in this format. Would this be correct for the data type format with hive compatible prefixes?

    "projection.day.format" = "year=yyyy/month=MM/day=dd"

  • The projection format can be like this: projection.columnName.format=yyyyMM/dd-MM-yyyy/dd-MM-yyyy-HH-mm-ss

    Note: A date format string based on the Java date format DateTimeFormatter. Can be any supported Java.time.* type.

0

Hello buddy. I ended up in the same situation as you today and found your question here. So I created this public Gist with what is working for me right now.

I'm using hourly partition in the Flow Logs, but if you don't use it, just remove the occurrences of the "hour" field from "PARTITIONED BY" and "TBLPROPERTIES".

Hope this helps you too. =)

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions