How can I "automatically" add new partitions to a Glue table based on a Hive formatted S3 bucket?

0

I have a Bucket containing AWS AppStream logs on format s3://appstream-logs.../sessions/schedule=DAILY/year=2024/month=04/day=03/daily-session-report-2024-04-03.csv. I have made this data available in Athena using a Glue table with a "hard coded" schema. (Schema is indeed static, not crawler should be necessary.)

Each day a new partition is created (first time a user starts an AppStream session), and I would like the new session log to be available in Athena as soon as possible. The log file is then appended throughout the day. I have found so many options for achieving this, yet nothing directly of-the-shelf that I can use. I would expect this to be so common that a solution already exists.

I think that one option is to write a lambda function triggered either directly via S3 events (?) or via Amazon EventBridge events (?), and that the Lambda can either do MSCK REPAIR TABLE (somehow from Python) or use boto3 client to do create_partition (with a a ton of input parmeters that I find somewhat poorly documented). Another option seams to be to use an Incremental Crawl, but I believe that is not needed due to my static schema. I also seams to have an option do to this using Glue Jobs where I may be able to write a similar code as in my Lambda Function.

Does anyone have any suggestions what is the best way to achieve this? Is there already perhaps a hosted solution for this on AWS, or someone knows some open source code that can help me achieving this? Mostly I would need help to setup a strategy/architecture on how to do this. I have about four more (AppStream) sources very similar to this with applications logs, event logs etc.

Andreax
feita há 3 meses227 visualizações
2 Respostas
1
Resposta aceita

If your only client is Athena and the partitions predictable, you could use Athena projections.
In general, the code that ingests the data should update the table to register the partitions added, if you write using Spark DataFrame or DynamicFrame.
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html#pyspark.sql.DataFrameWriter.saveAsTable

profile pictureAWS
ESPECIALISTA
respondido há 3 meses
  • Thank you, Athena projections worked out really nice! Sorry for late reply, took some time to prioritise and verify. :-)

0

AFAIK, crawler is not just to update schema but about to update partitions which is what you are looking for. A daily schedule to run crawler to detect new folders should suffice. If the new folder (daily log) might end up being created any time of the day, you might look at using S3 events to trigger the crawler.

Maybe this blog might be useful

--Syd

profile picture
Syd
respondido há 3 meses
profile picture
ESPECIALISTA
avaliado há 3 meses
  • A crawler CAN update the partitions, but it does not seam to be necessary, there are at least two other ways to update partitions on HIVE formatted S3 buckets, MSCK REPAIR TABLE and glue.client. create_partition. I just fint it odd that there is not a default way to do it. in GCP it's basically a boolean switch, "Auto add new partitions" and that's it...

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas