- Newest
- Most votes
- Most comments
If your only client is Athena and the partitions predictable, you could use Athena projections.
In general, the code that ingests the data should update the table to register the partitions added, if you write using Spark DataFrame or DynamicFrame.
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html#pyspark.sql.DataFrameWriter.saveAsTable
AFAIK, crawler is not just to update schema but about to update partitions which is what you are looking for. A daily schedule to run crawler to detect new folders should suffice. If the new folder (daily log) might end up being created any time of the day, you might look at using S3 events to trigger the crawler.
Maybe this blog might be useful
--Syd
A crawler CAN update the partitions, but it does not seam to be necessary, there are at least two other ways to update partitions on HIVE formatted S3 buckets,
MSCK REPAIR TABLE
andglue.client. create_partition
. I just fint it odd that there is not a default way to do it. in GCP it's basically a boolean switch, "Auto add new partitions" and that's it...
Relevant content
- asked 8 months ago
- asked 3 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago
Thank you, Athena projections worked out really nice! Sorry for late reply, took some time to prioritise and verify. :-)