How To Get Bad Records Using AWS Pydeequ - Data Quality Checks

0

Using AWS Pydeequ in databricks I am performing Data Quality checks. When I run this below mentioned code it provide only metrics results as my output (like Check_level, check_status, constraint, constraint_status, constraint_message). My Question is how can I get the failed records(Bad records) put it in separate dataframe or a table along with metrics(constraint_status, constraint_message) bad data should not process further and split good record put it in separate dataframe to process further ?

Source_DF:

df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

Code:

from pydeequ.checks import * from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark)
.onData(source)
.addCheck( check.hasSize(lambda x: x >= 3000000)
.hasMin("star_rating", lambda x: x == 1.0)
.hasMax("star_rating", lambda x: x == 5.0)
.isComplete("review_id")
.isUnique("review_id")
.isComplete("marketplace")
.isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"])
.isNonNegative("year"))
.run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.display()

Please share any solution or codes to achieve this scenario. That would be helpful.

Sem respostas

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas