By using AWS re:Post, you agree to the Terms of Use

How To Get Bad Records Using AWS Pydeequ - Data Quality Checks


Using AWS Pydeequ in databricks I am performing Data Quality checks. When I run this below mentioned code it provide only metrics results as my output (like Check_level, check_status, constraint, constraint_status, constraint_message). My Question is how can I get the failed records(Bad records) put it in separate dataframe or a table along with metrics(constraint_status, constraint_message) bad data should not process further and split good record put it in separate dataframe to process further ?


df ="s3a://amazon-reviews-pds/parquet/product_category=Electronics/")


from pydeequ.checks import * from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark)
.addCheck( check.hasSize(lambda x: x >= 3000000)
.hasMin("star_rating", lambda x: x == 1.0)
.hasMax("star_rating", lambda x: x == 5.0)
.isContainedIn("marketplace", ["US", "UK", "DE", "JP", "FR"])

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult) checkResult_df.display()

Please share any solution or codes to achieve this scenario. That would be helpful.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions