accessing Athena pseudo columns in Glue ETL job custom query

0

background:

Users upload files to an S3 bucket containing their predictions for events under certain circumstances. I want my query results to only show the most recently made prediction for the circumstances.

Example: on 11/22/2023 a file is uploaded.

condition 1condition 2prediction
redblue12
yellowgreen4

then on 11/23/2023 another file is uploaded

condition 1condition 2prediction
redyellow9
purpleorange11
yellowgreen7

when I run my query I should see the following results

condition 1condition 2prediction
redblue12
redyellow9
purpleorange11
yellowgreen7

In Athena I have been able to accomplish this through the use of the pseudo column $file_modified_time. Which allows me to query the S3 object metadata telling me when the file containing each record was last updated.

However now I'm trying to run the query as a part of a Glue ETL job, so instead of using Presto like Athena, I have to use Spark SQL. My problem is Spark SQL doesn't recognize $file_modified_time as a column so I am trying to figure out if there is an alternative I can use? (unless I can just run an Athena query within a Glue ETL job and I just don't know it).

here is Presto's documentation on the pseudo columns if that is any help. https://prestodb.io/docs/current/connector/hive.html#extra-hidden-columns

has anybody else ran into this before and found a solution?

Thanks!

2 Answers
0
Accepted Answer

Just to add on to Gonzolo's response, Glue itself doesn’t provide direct functionality for retrieving the timestamp of when a file was saved in a S3 bucket. To retrieve such information, you can use Boto client methods to achieve this. Specifically you can use the:

  • head_object() method [1]
  • list_objects_v2() method [2]

Please see the external resource below to see example code of how this can be achieved.

Alternatively, you can save your Athena query and execute the saved query through a Glue job. Please see example code of this in this AWS Blog post [3].

References:

AWS
answered 5 months ago
0

It would be better if the predictions has a timestamp, instead of relying on the file modification date, which could be affected by other things.
Otherwise, I don't think there is a way in Glue/Spark but you could invoke Athena from Glue and read the results (it's a bit wasteful since the cluster would be waiting while Athena is running) and then ask Glue to read the Athena query results and continue from there.

profile pictureAWS
EXPERT
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions