When running a glue data quality evaluation, how can I filter the data it is run on?


I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on?

Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.

asked 2 years ago300 views
1 Answer

The start_data_quality_ruleset_evaluation_run API allows you to pass a pushDownPredicate in its Additional options section. You can also similarly pass a CatalogPartititionPredicate

aws glue start-data-quality-ruleset-evaluation-run --ruleset-names ruleset1 --region us-east-2 --data-source '{"GlueTable":{"DatabaseName": "db", "TableName":"tb", "AdditionalOptions": {"pushDownPredicate": "year=\"2022\""}}}' --role Admin

A similar API structure can be use for getting recommendations post filtering as well

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions