Injected projected partitions cannot participate in JOINs?

0

Is there any way around this?

SELECT * FROM event_data
JOIN event_index ON(
    event_data.systemid = event_index.systemid AND 
    event_data.partition_day = event_index.partition_day)
limit 10;

In the schema, partition_day is an injected partition in event_data. It's just a regular column in event_index.

Result:

CONSTRAINT_VIOLATION: Injected projected partition column partition_day must have only (and at least one) equality conditions in the WHERE clause! This query ran against the "test-xxxx" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: a78ce597-fcd8-469a-b402-f8a3b6e613a8

I've been struggling to try to force Athena to use partition pruning on partition_day on a very complex query, and this is just the latest failed attempt.

AlexR
asked a month ago514 views
2 Answers
0

UPDATED QUERY:

Hey AlexR, could you please try running this query?

WITH systemid_partition_day AS (
    SELECT systemid, partition_day
    FROM event_index
)
SELECT ed.*
FROM event_data ed
JOIN systemid_partition_day spd ON ed.systemid = spd.systemid
WHERE ed.partition_day = spd.partition_day
LIMIT 10;

This query joins event_data with event_index based on systemid and partition_day, filtering event_data to include only rows where the partition_day matches the partition_day associated with each systemid in event_index, limiting the result to 10 rows.

or

SELECT ed.*
FROM event_data ed
JOIN (
    SELECT systemid, partition_day
    FROM event_index
) spd ON ed.systemid = spd.systemid
WHERE ed.partition_day = spd.partition_day
LIMIT 10;

Sources:

profile picture
EXPERT
answered a month ago
  • Unfortunately this is not the same as the original query. The partition_day is not known in advance. The event_index table exists to map systemid to partition_day. The systemid is known at query time (it is an input to the query), but the partition_day is not. As an analogy to help you understand, think about an airport tracking the ID of traveler's wi-fi devices. When you want to query the logs for a given ID, you do not know in advance which days the device was at the airport. That is the purpose of the event_index table.

  • Hey AlexR, I've made some updates to the query. Could you give it a try and let me know if it works?

-1

Hi, I was looking at your earlier question on this topic too - i wonder how it behaves if you explicitly set the

event_data.partition_day=xxxxx event_index.partition_day=xxxxx as well as doing the join

It (should) be able to just work this out for itself as it's implicit by the join statement - but maybe being explicit helps it work out the more efficient pruned plan?

Cheers, Rich

AWS
answered a month ago
  • Setting the partition_day explicitly "works" (the error goes away) but is not the correct query: the event_index can contain more than one partition_day for the same given systemid. The event_data table contains 1000x more data than the event_index table, so it is critical that the event_index must be queried first, followed by reading only the relevant partitions from event_data. I can make this work by making two separate queries and setting the explicit partition_day in the second query. That's effectively moving the JOIN to application code. That's an anti-pattern, a form of technical debt that I would like to resolve.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions