How To Get Bad Records Using AWS Pydeequ - Data Quality Checks

0

Using AWS Pydeequ in databricks I am performing Data Quality checks. When I run this below mentioned code it provide only metrics results as my output (like Check_level, check_status, constraint, constraint_status, constraint_message). My Question is how can I get the failed records(Bad records) put it in separate dataframe or a table along with metrics(constraint_status, constraint_message) bad data should not process further and split good record put it in separate dataframe to process further ?

Source_DF:

df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

Code:

from pydeequ.checks import * from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark)
.onData(source)
.addCheck( check.hasSize(lambda x: x >= 3000000)
.hasMin("star_rating", lambda x: x == 1.0)
.hasMax("star_rating", lambda x: x == 5.0)
.isComplete("review_id")
.isUnique("review_id")
.isComplete("marketplace")
.isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"])
.isNonNegative("year"))
.run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.display()

Please share any solution or codes to achieve this scenario. That would be helpful.

Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande