Data Quality Framework in AWS
0
I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are:
- The data pipelines widely vary and ingest very high volumes of data. They are developed using spark,python,emr clusters, kafka, Kinesis stream
- Any new system that we onboard in the framework, it should be easily able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created.
- Our tech stack includes AWS,Python,Spark, Java, so kindly advise related services(AWS Databrew, PyDeequ, Greatexpectations libraries, various lambda event driven services are some I want to focus)
- I am also looking for some sort of audit, balance and control mechanism. Auditing the source data, balancing # of records between 2 points and have some automated mechanism to remediate(control) them.
- I am looking for testing frameworks for the different data pipelines. Also for data profiling, kindly advise tools/libraries, Aws data brew, Pandas are some I am exploring.
I know there wont be one specific solution, and hence appreciate all and any different ideas. A flow diagram with Audit, balance and control with automated data validation and testing mechanism for data pipelines can be very helpful.
asked 13 days ago11 views
1 Answers
0
We have a number of examples:
- How to Architect Data Quality on the AWS Cloud
- Building a serverless data quality and analysis framework with Deequ and AWS Glue
- Build event-driven data quality pipelines with AWS Glue DataBrew
- Test data quality at scale with Deequ
- Monitor data quality in your data lake using PyDeequ and AWS Glue
There are likely more. Hope these help.
answered 12 days ago
Relevant questions
How to stream CloudFront real time logs to cloudwatch
asked 3 months ago[Launch Announcement] AWS AppSync adds support for enhanced filtering in real-time GraphQL subscriptions
asked 2 months agoPattern for streaming changes from RDS
Accepted Answerasked 2 years agoDB Log Processing through Kinesis Data streams and Time Series DB
asked 5 months agoSource Data for TimeStream Database
asked 4 months agoDynamoDB: Time frame to avoid stale read
asked 2 months agoWithin Quicksight, is there a way to visualize real time data from an S3 bucket
Accepted Answerasked 2 years agoI am trying to write an ETL job to the Data Catalog but its writing the Headers as Data
Accepted Answerasked 2 months agoData Quality Framework in AWS
asked 13 days agoWhat services should I use if I want to do live monitoring the data?
asked 3 years ago
Thank you! I have checked some of these links and they certainly are helpful to design what I need. Do you have any recommendation for this scenario: Any new system that we onboard in the framework, it should be easily able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created