When running a glue data quality evaluation, how can I filter the data it is run on?

0

I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on?

Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.

已提问 1 年前228 查看次数
1 回答
0

The start_data_quality_ruleset_evaluation_run API allows you to pass a pushDownPredicate in its Additional options section. You can also similarly pass a CatalogPartititionPredicate

aws glue start-data-quality-ruleset-evaluation-run --ruleset-names ruleset1 --region us-east-2 --data-source '{"GlueTable":{"DatabaseName": "db", "TableName":"tb", "AdditionalOptions": {"pushDownPredicate": "year=\"2022\""}}}' --role Admin

A similar API structure can be use for getting recommendations post filtering as well

AWS
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则