- Newest
- Most votes
- Most comments
Based on my research, you can schedule AWS Glue jobs from the console using cron expressions. This allows you to define a time-based schedule for your jobs with a minimum precision of 5 minutes. For instance, you could set up a job to run every day at a specific time, or at intervals throughout the day.
https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html
For incremental data processing, AWS Glue provides a feature called "job bookmarks". This feature allows AWS Glue to keep track of data that has already been processed, thereby preventing the reprocessing of old data during subsequent job runs. You can enable job bookmarks when setting up your AWS Glue job, and use them in combination with a "transformation context" parameter in your ETL code to manage state information for incremental data sets. This way, only the new data that has arrived since the last job run will be processed.
Incremental Data quality validation with Glue Data Quality on Data Catalog is possible using partition push down predicates. As an example, if your data is partitioned by day and month, you can enter 'day=01 and day=02 and month=07' in partition predicate field that you have in your question. In this case, data quality is validated only for the first and second day of the month.
However, this field is static and does not take a parameter such as 'previous 7 days'. If you want to achieve this, instead of planning a single scheduled job that runs frequently, you need to run each job on-demand at the specific time after modification of partition predicate. So to achieve automation, you need to deploy a Cron job that triggers a Lamda that runs a Data Quality Rule with the right partition predicate at the right time and you need to keep track of the job runs.
Please note that you can run data quality jobs in Glue ETL jobs (this is called proactive data quality) as well as Glue Data Catalog Quality(the method you have mentioned). If you use this method, then you can benefit from job bookmarks. For the reference see : https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html
If set to day=02 and month=07, wouldn't the scheduled run always take into account the same set of data from all the years where data is from second day of July? Based on your answer Glue Data Catalog Quality seems like a pretty useless service and I have no idea why it supports scheduling if it is not even possible to have dynamic filtering.
To filter a specific number of days' worth of data in AWS Glue, you can use PySpark's DataFrame API to filter the data. If your data includes a timestamp or date column, you can use the where function to filter the DataFrame based on this column. For example, to filter data for the previous day, you can use the following code:
from pyspark.sql import functions as F
df = Transform1.toDF()
df = df.where(F.col("time") == F.date_sub(F.current_date(), 1))
Transform2 = DynamicFrame.fromDF(df, glue_ctx=glueContext, name="df")
https://stackoverflow.com/questions/68505806/aws-glue-filter-date-field
In regards to your query about "Catalog partition predicate", AWS Glue allows you to use partition predicates to filter data on partitions directly on the partition metadata in the Data Catalog without having to list and read all the files in your dataset. This can save a significant amount of processing time if you only need a small subset of your data. The predicate expression can be any Boolean expression supported by Spark SQL. For instance, the predicate expression pushDownPredicate = "(year=='2017' and month=='04')" would load only the partitions in the Data Catalog that have both year equal to 2017 and month equal to 04.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
There's also an option to use server-side partition pruning with the catalogPartitionPredicate option if you have a lot of partitions for a table and listing all the partitions from the catalog can incur additional time overhead. This option uses partition indexes in the AWS Glue Data Catalog and can make partition filtering much faster when you have millions of partitions in one table. You can use both push_down_predicate and catalogPartitionPredicate in additional_options together if your catalogPartitionPredicate requires predicate syntax that is not yet supported with the catalog partition indexes.
Relevant content
- asked 2 years ago
- asked a year ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
This answers none of my issues. I know how to set a schedule but I don't know how to apply a filter to only take x days worth of data into consideration. I don't see any option for "job bookmarks" for data quality rulesets. Maybe you are talking about some other service. As far as I know there are only options for "Filter data at source" and "Catalog partition predicate" for which I don't know what kind of syntax it can accept.