How To Get Bad Records Using AWS Pydeequ - Data Quality Checks

0

Using AWS Pydeequ in databricks I am performing Data Quality checks. When I run this below mentioned code it provide only metrics results as my output (like Check_level, check_status, constraint, constraint_status, constraint_message). My Question is how can I get the failed records(Bad records) put it in separate dataframe or a table along with metrics(constraint_status, constraint_message) bad data should not process further and split good record put it in separate dataframe to process further ?

Source_DF:

df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/")

Code:

from pydeequ.checks import * from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark)
.onData(source)
.addCheck( check.hasSize(lambda x: x >= 3000000)
.hasMin("star_rating", lambda x: x == 1.0)
.hasMax("star_rating", lambda x: x == 5.0)
.isComplete("review_id")
.isUnique("review_id")
.isComplete("marketplace")
.isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"])
.isNonNegative("year"))
.run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.display()

Please share any solution or codes to achieve this scenario. That would be helpful.

No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas