accessing Athena pseudo columns in Glue ETL job custom query

0

background:

Users upload files to an S3 bucket containing their predictions for events under certain circumstances. I want my query results to only show the most recently made prediction for the circumstances.

Example: on 11/22/2023 a file is uploaded.

condition 1condition 2prediction
redblue12
yellowgreen4

then on 11/23/2023 another file is uploaded

condition 1condition 2prediction
redyellow9
purpleorange11
yellowgreen7

when I run my query I should see the following results

condition 1condition 2prediction
redblue12
redyellow9
purpleorange11
yellowgreen7

In Athena I have been able to accomplish this through the use of the pseudo column $file_modified_time. Which allows me to query the S3 object metadata telling me when the file containing each record was last updated.

However now I'm trying to run the query as a part of a Glue ETL job, so instead of using Presto like Athena, I have to use Spark SQL. My problem is Spark SQL doesn't recognize $file_modified_time as a column so I am trying to figure out if there is an alternative I can use? (unless I can just run an Athena query within a Glue ETL job and I just don't know it).

here is Presto's documentation on the pseudo columns if that is any help. https://prestodb.io/docs/current/connector/hive.html#extra-hidden-columns

has anybody else ran into this before and found a solution?

Thanks!

2回答
0
承認された回答

Just to add on to Gonzolo's response, Glue itself doesn’t provide direct functionality for retrieving the timestamp of when a file was saved in a S3 bucket. To retrieve such information, you can use Boto client methods to achieve this. Specifically you can use the:

  • head_object() method [1]
  • list_objects_v2() method [2]

Please see the external resource below to see example code of how this can be achieved.

Alternatively, you can save your Athena query and execute the saved query through a Glue job. Please see example code of this in this AWS Blog post [3].

References:

AWS
回答済み 5ヶ月前
0

It would be better if the predictions has a timestamp, instead of relying on the file modification date, which could be affected by other things.
Otherwise, I don't think there is a way in Glue/Spark but you could invoke Athena from Glue and read the results (it's a bit wasteful since the cluster would be waiting while Athena is running) and then ask Glue to read the Athena query results and continue from there.

profile pictureAWS
エキスパート
回答済み 5ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ