- 最新
- 最多得票
- 最多評論
If your only client is Athena and the partitions predictable, you could use Athena projections.
In general, the code that ingests the data should update the table to register the partitions added, if you write using Spark DataFrame or DynamicFrame.
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html#pyspark.sql.DataFrameWriter.saveAsTable
AFAIK, crawler is not just to update schema but about to update partitions which is what you are looking for. A daily schedule to run crawler to detect new folders should suffice. If the new folder (daily log) might end up being created any time of the day, you might look at using S3 events to trigger the crawler.
Maybe this blog might be useful
--Syd
A crawler CAN update the partitions, but it does not seam to be necessary, there are at least two other ways to update partitions on HIVE formatted S3 buckets,
MSCK REPAIR TABLE
andglue.client. create_partition
. I just fint it odd that there is not a default way to do it. in GCP it's basically a boolean switch, "Auto add new partitions" and that's it...
相關內容
- 已提問 6 個月前
- AWS 官方已更新 2 年前
- AWS 官方已更新 3 年前
Thank you, Athena projections worked out really nice! Sorry for late reply, took some time to prioritise and verify. :-)