I have defined two tables:
CREATE EXTERNAL TABLE `event_data`(
`systemid` string COMMENT 'from deserializer',
`eventtime` string COMMENT 'from deserializer',
`eventtype` string COMMENT 'from deserializer',
`source` string COMMENT 'from deserializer',
`updtdate` string COMMENT 'from deserializer',
`rawdata` string COMMENT 'from deserializer',
`media` string COMMENT 'from deserializer')
PARTITIONED BY (
`partition_day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-bucket/incoming/data/'
TBLPROPERTIES (
'classification'='json',
'projection.enabled'='true',
'projection.partition_day.format'='yyyy-MM-dd',
'projection.partition_day.range'='2010-01-01,NOW',
'projection.partition_day.type'='date',
'storage.location.template'='s3://test-bucket/incoming/data/${partition_day}/')
and
CREATE EXTERNAL TABLE `event_index`(
`systemid` string COMMENT 'from deserializer',
`eventtime` string COMMENT 'from deserializer',
`eventtype` string COMMENT 'from deserializer',
`source` string COMMENT 'from deserializer',
`updtdate` string COMMENT 'from deserializer',
`partition_day` string COMMENT 'from deserializer')
PARTITIONED BY (
`partition_year` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-bucket/incoming/index/'
TBLPROPERTIES (
'classification'='json',
'projection.enabled'='true',
'projection.partition_year.format'='yyyy',
'projection.partition_year.range'='2010,NOW',
'projection.partition_year.type'='date',
'storage.location.template'='s3://test-bucket/incoming/index/${partition_year}/')
then I inserted some data into partitions 2024-03-09 and 2024-03-11 and ran some simple test queries:
select * from event_data where partition_day = '2024-03-09'
select * from event_index where event_index.updtdate <= '2024-03-10'
SELECT * FROM event_index JOIN event_data USING(partition_day, systemid) where event_index.updtdate <= '2024-03-10'
By looking at the "Data scanned" metric when querying in the console, I can see clearly that partition pruning works fine with the single-table queries, but degrades to full table scan in the JOIN query.
Why is that (not) happening?
EDIT: if I simply swap the order of tables in the JOIN, the partition pruning happens as expected.
EDIT2: If I put the query with the swapped JOIN into a View, then the pruning fails again, so I'm back to where I started.