When running a glue data quality evaluation, how can I filter the data it is run on?

0

I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on?

Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.

質問済み 1年前228ビュー
1回答
0

The start_data_quality_ruleset_evaluation_run API allows you to pass a pushDownPredicate in its Additional options section. You can also similarly pass a CatalogPartititionPredicate

aws glue start-data-quality-ruleset-evaluation-run --ruleset-names ruleset1 --region us-east-2 --data-source '{"GlueTable":{"DatabaseName": "db", "TableName":"tb", "AdditionalOptions": {"pushDownPredicate": "year=\"2022\""}}}' --role Admin

A similar API structure can be use for getting recommendations post filtering as well

AWS
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ