Injected projected partitions cannot participate in JOINs?

0

Is there any way around this?

SELECT * FROM event_data
JOIN event_index ON(
    event_data.systemid = event_index.systemid AND 
    event_data.partition_day = event_index.partition_day)
limit 10;

In the schema, partition_day is an injected partition in event_data. It's just a regular column in event_index.

Result:

CONSTRAINT_VIOLATION: Injected projected partition column partition_day must have only (and at least one) equality conditions in the WHERE clause! This query ran against the "test-xxxx" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: a78ce597-fcd8-469a-b402-f8a3b6e613a8

I've been struggling to try to force Athena to use partition pruning on partition_day on a very complex query, and this is just the latest failed attempt.

AlexR
gefragt vor 2 Monaten520 Aufrufe
2 Antworten
0

UPDATED QUERY:

Hey AlexR, could you please try running this query?

WITH systemid_partition_day AS (
    SELECT systemid, partition_day
    FROM event_index
)
SELECT ed.*
FROM event_data ed
JOIN systemid_partition_day spd ON ed.systemid = spd.systemid
WHERE ed.partition_day = spd.partition_day
LIMIT 10;

This query joins event_data with event_index based on systemid and partition_day, filtering event_data to include only rows where the partition_day matches the partition_day associated with each systemid in event_index, limiting the result to 10 rows.

or

SELECT ed.*
FROM event_data ed
JOIN (
    SELECT systemid, partition_day
    FROM event_index
) spd ON ed.systemid = spd.systemid
WHERE ed.partition_day = spd.partition_day
LIMIT 10;

Sources:

profile picture
EXPERTE
beantwortet vor 2 Monaten
  • Unfortunately this is not the same as the original query. The partition_day is not known in advance. The event_index table exists to map systemid to partition_day. The systemid is known at query time (it is an input to the query), but the partition_day is not. As an analogy to help you understand, think about an airport tracking the ID of traveler's wi-fi devices. When you want to query the logs for a given ID, you do not know in advance which days the device was at the airport. That is the purpose of the event_index table.

  • Hey AlexR, I've made some updates to the query. Could you give it a try and let me know if it works?

-1

Hi, I was looking at your earlier question on this topic too - i wonder how it behaves if you explicitly set the

event_data.partition_day=xxxxx event_index.partition_day=xxxxx as well as doing the join

It (should) be able to just work this out for itself as it's implicit by the join statement - but maybe being explicit helps it work out the more efficient pruned plan?

Cheers, Rich

AWS
beantwortet vor 2 Monaten
  • Setting the partition_day explicitly "works" (the error goes away) but is not the correct query: the event_index can contain more than one partition_day for the same given systemid. The event_data table contains 1000x more data than the event_index table, so it is critical that the event_index must be queried first, followed by reading only the relevant partitions from event_data. I can make this work by making two separate queries and setting the explicit partition_day in the second query. That's effectively moving the JOIN to application code. That's an anti-pattern, a form of technical debt that I would like to resolve.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen