When running a glue data quality evaluation, how can I filter the data it is run on?

0

I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on?

Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.

demandé il y a un an228 vues
1 réponse
0

The start_data_quality_ruleset_evaluation_run API allows you to pass a pushDownPredicate in its Additional options section. You can also similarly pass a CatalogPartititionPredicate

aws glue start-data-quality-ruleset-evaluation-run --ruleset-names ruleset1 --region us-east-2 --data-source '{"GlueTable":{"DatabaseName": "db", "TableName":"tb", "AdditionalOptions": {"pushDownPredicate": "year=\"2022\""}}}' --role Admin

A similar API structure can be use for getting recommendations post filtering as well

AWS
répondu il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions