How To Get Bad Records Using AWS Pydeequ - Data Quality Checks

0

Using AWS Pydeequ in databricks I am performing Data Quality checks. When I run this below mentioned code it provide only metrics results as my output (like Check_level, check_status, constraint, constraint_status, constraint_message). My Question is how can I get the failed records(Bad records) put it in separate dataframe or a table along with metrics(constraint_status, constraint_message) bad data should not process further and split good record put it in separate dataframe to process further ?

Source_DF:

df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

Code:

from pydeequ.checks import * from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark)
.onData(source)
.addCheck( check.hasSize(lambda x: x >= 3000000)
.hasMin("star_rating", lambda x: x == 1.0)
.hasMax("star_rating", lambda x: x == 5.0)
.isComplete("review_id")
.isUnique("review_id")
.isComplete("marketplace")
.isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"])
.isNonNegative("year"))
.run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.display()

Please share any solution or codes to achieve this scenario. That would be helpful.

Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen