How can I "automatically" add new partitions to a Glue table based on a Hive formatted S3 bucket?

0

I have a Bucket containing AWS AppStream logs on format s3://appstream-logs.../sessions/schedule=DAILY/year=2024/month=04/day=03/daily-session-report-2024-04-03.csv. I have made this data available in Athena using a Glue table with a "hard coded" schema. (Schema is indeed static, not crawler should be necessary.)

Each day a new partition is created (first time a user starts an AppStream session), and I would like the new session log to be available in Athena as soon as possible. The log file is then appended throughout the day. I have found so many options for achieving this, yet nothing directly of-the-shelf that I can use. I would expect this to be so common that a solution already exists.

I think that one option is to write a lambda function triggered either directly via S3 events (?) or via Amazon EventBridge events (?), and that the Lambda can either do MSCK REPAIR TABLE (somehow from Python) or use boto3 client to do create_partition (with a a ton of input parmeters that I find somewhat poorly documented). Another option seams to be to use an Incremental Crawl, but I believe that is not needed due to my static schema. I also seams to have an option do to this using Glue Jobs where I may be able to write a similar code as in my Lambda Function.

Does anyone have any suggestions what is the best way to achieve this? Is there already perhaps a hosted solution for this on AWS, or someone knows some open source code that can help me achieving this? Mostly I would need help to setup a strategy/architecture on how to do this. I have about four more (AppStream) sources very similar to this with applications logs, event logs etc.

Andreax
posta 3 mesi fa227 visualizzazioni
2 Risposte
1
Risposta accettata

If your only client is Athena and the partitions predictable, you could use Athena projections.
In general, the code that ingests the data should update the table to register the partitions added, if you write using Spark DataFrame or DynamicFrame.
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html#pyspark.sql.DataFrameWriter.saveAsTable

profile pictureAWS
ESPERTO
con risposta 3 mesi fa
  • Thank you, Athena projections worked out really nice! Sorry for late reply, took some time to prioritise and verify. :-)

0

AFAIK, crawler is not just to update schema but about to update partitions which is what you are looking for. A daily schedule to run crawler to detect new folders should suffice. If the new folder (daily log) might end up being created any time of the day, you might look at using S3 events to trigger the crawler.

Maybe this blog might be useful

--Syd

profile picture
Syd
con risposta 3 mesi fa
profile picture
ESPERTO
verificato 3 mesi fa
  • A crawler CAN update the partitions, but it does not seam to be necessary, there are at least two other ways to update partitions on HIVE formatted S3 buckets, MSCK REPAIR TABLE and glue.client. create_partition. I just fint it odd that there is not a default way to do it. in GCP it's basically a boolean switch, "Auto add new partitions" and that's it...

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande