accessing Athena pseudo columns in Glue ETL job custom query

0

background:

Users upload files to an S3 bucket containing their predictions for events under certain circumstances. I want my query results to only show the most recently made prediction for the circumstances.

Example: on 11/22/2023 a file is uploaded.

condition 1condition 2prediction
redblue12
yellowgreen4

then on 11/23/2023 another file is uploaded

condition 1condition 2prediction
redyellow9
purpleorange11
yellowgreen7

when I run my query I should see the following results

condition 1condition 2prediction
redblue12
redyellow9
purpleorange11
yellowgreen7

In Athena I have been able to accomplish this through the use of the pseudo column $file_modified_time. Which allows me to query the S3 object metadata telling me when the file containing each record was last updated.

However now I'm trying to run the query as a part of a Glue ETL job, so instead of using Presto like Athena, I have to use Spark SQL. My problem is Spark SQL doesn't recognize $file_modified_time as a column so I am trying to figure out if there is an alternative I can use? (unless I can just run an Athena query within a Glue ETL job and I just don't know it).

here is Presto's documentation on the pseudo columns if that is any help. https://prestodb.io/docs/current/connector/hive.html#extra-hidden-columns

has anybody else ran into this before and found a solution?

Thanks!

2개 답변
0
수락된 답변

Just to add on to Gonzolo's response, Glue itself doesn’t provide direct functionality for retrieving the timestamp of when a file was saved in a S3 bucket. To retrieve such information, you can use Boto client methods to achieve this. Specifically you can use the:

  • head_object() method [1]
  • list_objects_v2() method [2]

Please see the external resource below to see example code of how this can be achieved.

Alternatively, you can save your Athena query and execute the saved query through a Glue job. Please see example code of this in this AWS Blog post [3].

References:

AWS
답변함 5달 전
0

It would be better if the predictions has a timestamp, instead of relying on the file modification date, which could be affected by other things.
Otherwise, I don't think there is a way in Glue/Spark but you could invoke Athena from Glue and read the results (it's a bit wasteful since the cluster would be waiting while Athena is running) and then ask Glue to read the Athena query results and continue from there.

profile pictureAWS
전문가
답변함 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인