Skip to content

Data Quality Rules for Hudi Tables Managed by Lake Formation in AWS DataZone

0

Problem statement: I'm working on creating Data Quality (DQ) rules for Hudi tables managed by AWS Lake Formation, and I need these rules results to be displayed in AWS DataZone. However, according to AWS documentation, AWS Glue DQ rules cannot be directly applied to Hudi tables managed by Lake Formation due to existing limitations (https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html#data-quality-release-notes-ga).

Use Case:

  • The Hudi tables are managed through Lake Formation and accessed via Athena for querying.
  • I want to implement DQ rules (e.g., schema validation, null checks, data range checks) for these Hudi tables and make DQ Results visible within AWS DataZone for stakeholders.
  • I don't want to use Glue jobs. Will create DQ ruleset from console and will run those using EMR jobs.

Question:

  • What is the recommended approach to create and integrate DQ rules for Hudi tables managed by Lake Formation?
  • How can I make these DQ rules results visible in AWS DataZone, given the Glue limitation?

Expected Outcome:

  • Ability to define, implement, and manage DQ rules for Hudi tables.
  • Seamless integration of these DQ rules results into AWS DataZone to support governance and visibility.

If there are alternative solutions, best practices, or workarounds, I’d appreciate any guidance.

asked 10 months ago139 views
1 Answer
1

To address your need for creating and integrating Data Quality (DQ) rules for Hudi tables managed by Lake Formation and making the results visible in AWS DataZone, given the limitations with AWS Glue DQ rules, you can consider the following approach:

  1. External Data Quality Solution: Instead of relying on AWS Glue Data Quality, you can implement an external data quality solution using Amazon EMR Serverless with a framework like PyDeequ. This allows you to create custom data quality rules tailored to your Hudi tables.

  2. Create Custom Data Quality Rules: Develop your data quality rules using PyDeequ or a similar framework. These rules can include schema validation, null checks, data range checks, and any other specific requirements for your Hudi tables.

  3. Execute Data Quality Checks: Run your custom data quality checks using EMR jobs as you've mentioned. These jobs can access the Hudi tables through Athena, applying the rules you've defined.

  4. Generate Data Quality Scores: After running the data quality checks, generate data quality scores based on the results of your rules.

  5. Integrate with AWS DataZone: AWS DataZone provides APIs for importing data quality scores from external systems. You can use these APIs to push your custom data quality results into DataZone. This integration allows you to display the data quality scores for your Hudi tables directly within the DataZone web portal.

  6. Implement the Integration: Create a script (e.g., in Python) that takes the results from your EMR job, formats them appropriately, and uses the Amazon DataZone APIs to post the data quality scores. This script can be executed as part of your EMR workflow or as a separate step after the data quality checks are complete.

  7. Configure Permissions: Ensure that Amazon DataZone has the necessary permissions to access the AWS Glue database managed through AWS Lake Formation. This may involve configuring Lake Formation permissions for Amazon DataZone.

  8. Visualize in DataZone: Once the integration is set up, you'll be able to view the data quality scores for your Hudi tables directly in the Amazon DataZone console, providing visibility to your stakeholders.

This approach allows you to create custom data quality rules for your Hudi tables, execute them using EMR jobs, and seamlessly integrate the results with AWS DataZone, overcoming the limitations of AWS Glue Data Quality for Hudi tables managed by Lake Formation.

Remember to regularly review and update your data quality rules as your data and requirements evolve. Also, consider implementing monitoring and alerting for your data quality process to ensure timely detection of any issues.
Sources
Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions | AWS Big Data Blog
Implement data quality checks on Amazon Redshift data assets and integrate with Amazon DataZone | AWS Big Data Blog

answered 10 months ago
EXPERT
reviewed 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.