When running a glue data quality evaluation, how can I filter the data it is run on?

0

I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on?

Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.

질문됨 일 년 전228회 조회
1개 답변
0

The start_data_quality_ruleset_evaluation_run API allows you to pass a pushDownPredicate in its Additional options section. You can also similarly pass a CatalogPartititionPredicate

aws glue start-data-quality-ruleset-evaluation-run --ruleset-names ruleset1 --region us-east-2 --data-source '{"GlueTable":{"DatabaseName": "db", "TableName":"tb", "AdditionalOptions": {"pushDownPredicate": "year=\"2022\""}}}' --role Admin

A similar API structure can be use for getting recommendations post filtering as well

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인