I am not sure if what I am asking for is possible or not, as I don't fully understand Athena and glue and all the nuances of how partition projection works. I am trying to switch all of our vpcflow logs from writing to CloudWatch Logs to writing directly to s3 in parquet format with hive compatible prefixes turned on. I have successfully gotten my parquet table and schema to work by loading the partitions using MSCK REPAIR TABLE. Initially I was going to call it done, but I'm worried about how long MSCK REPAIR TABLE will take to run to load new partitions and we are centralizing our vpcflow logs from many accounts into one s3 bucket. I also don't know if there is any cost implications the more objects in the bucket. So my thought was to use partition projection, which has the added benefit of not having to run a MSCK REPAIR TABLE on a scheduled basis. I haven't seen any examples of someone using partition projection with parquet formatted objects or hive compatible s3 prefixes.
Can someone confirm if this is possible? Should I not be using hive compatible prefixes in this scenario?
Here is what my glue table looks like where I am trying to use partition projection. It is from a Cloudformation Template so the base bucket name is a param but should be enough information to examine where I am going wrong:
GlueTableVpcFlowParquet:
Type: AWS::Glue::Table
Properties:
CatalogId:
Ref: AWS::AccountId
DatabaseName:
Ref: GlueDatabase
TableInput:
Name:
Fn::Join:
- ''
- - Ref: TableNameVPCFlowParquet
- Ref: CIAppend
TableType: EXTERNAL_TABLE
Parameters:
skip.header.line.count: '1'
projection.enabled: 'true'
projection.aws_account_id.type: injected
projection.year.type: integer
projection.year.range: 2022,9999
projection.month.type: integer
projection.month.range: 01,12
# projection.month.digits: '2'
projection.day.type: integer
projection.day.range: 01,31
projection.aws_region.type: enum
projection.aws_region.values:
Ref: TableRegions
storage.location.template:
Fn::Join:
- ''
- - Ref: TableLocationVpcFlowParquet
- "/awslogs/aws-account-id=${aws_account_id}/aws-service=vpcflowlogs/aws-region=${aws_region}/year=${year}/month=${month}/day=${day}"
PartitionKeys:
- Name: aws_account_id
Type: string
- Name: aws_region
Type: string
- Name: year
Type: int
- Name: month
Type: int
- Name: day
Type: int
StorageDescriptor:
Columns:
- Name: version
Type: int
- Name: account_id
Type: string
- Name: interface_id
Type: string
- Name: srcaddr
Type: string
- Name: dstaddr
Type: string
- Name: srcport
Type: int
- Name: dstport
Type: int
- Name: protocol
Type: bigint
- Name: packets
Type: bigint
- Name: bytes
Type: bigint
- Name: start
Type: bigint
- Name: end
Type: bigint
- Name: action
Type: string
- Name: log_status
Type: string
Location:
Fn::Join:
- ''
- - Ref: TableLocationVpcFlowParquet
- ""
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
BucketColumns: []
SortColumns: []
Object prefixes in s3 look like this:
s3://<bucketName>/AWSLogs/aws-account-id=012345678912/aws-service=vpcflowlogs/aws-region=us-east-1/year=2022/month=11/day=10/012345678912_vpcflowlogs_us-east-1_fl-003e10087957dd87a_20221110T1850Z_33d78e4c.log.parquet
I run Athena search like this:
SELECT * FROM "iaasdatasources"."vpcflowlogsparquet"
WHERE aws_account_id='012345678912' and aws_region='us-east-1' and year=2022 and month=11 and day=14
limit 10
It seems to run for 20 seconds but returns zero results and no errors.
Also if anyone has advice on doing a data_time partition projection instead of separate year month day partitions, that would be helpful. I wasn't sure what the format of should be considering the date locations in the path are hive formatted year=<year>/month=<month>/day=<day>
Thanks for your response @chaitu. Related to what you mentioned about not having 'aws-service' as a partition, I did add it as a partition as well as got rid of the template storage location since partition projection should recognize the column values automatically without being told where they are. With those changes I am not getting back data as expected! But I am still unclear on how to format a date partition in this format. Would this be correct for the data type format with hive compatible prefixes?
"projection.day.format" = "year=yyyy/month=MM/day=dd"
The projection format can be like this: projection.columnName.format=yyyyMM/dd-MM-yyyy/dd-MM-yyyy-HH-mm-ss
Note: A date format string based on the Java date format DateTimeFormatter. Can be any supported Java.time.* type.