Skip to content

AWS Glue Data Quality rules not passing based on not recognising schema contents like column names

1

What is causing data quality rules not to properly run against my table that i am pointing to?

I am directly in a table that i created in athena and i am trying to run the new data quality feature against it. The issue is, even though i constructed the rules using the UI and adding the column names directly as they are, when i run the ruleset i get errors like the below:

Rule_1 - ColumnCount = 18, Rule failed Dataset has 0.0 columns and failed to satisfy constraint Rule_3 - IsUnique "<columnName>", Rule failed, Input data does not include column <columnName>!

none of my tests pass- it seems like i am not correctly pointing my ruleset to the schema, but i am actually trying to run data quality from inside the table- anyone know why this would be happening?

asked 10 months ago395 views
2 Answers
0

Greeting

Hi Rachel,

Thanks for reaching out about your issue with AWS Glue Data Quality rules! I can imagine how frustrating it must be to see these errors despite your efforts to configure the rules correctly. Don’t worry—I’m here to help figure this out and get things working for you. 😊


Clarifying the Issue

From your description, it seems you're working with a table created in Athena and attempting to apply AWS Glue Data Quality rules to validate the data. However, you're encountering errors like "dataset has 0.0 columns" and "Input data does not include column <columnName>," which indicate that Glue isn't correctly recognizing your schema.

It sounds like the Glue Data Catalog may not be fully synchronized with your Athena table, or there might be an issue with how the ruleset is linked to the schema. These challenges are common when dealing with Glue and Athena integration, and I’m confident we can troubleshoot this effectively. Let's take it step by step!


Key Terms

  • AWS Glue Data Quality: A feature that validates datasets using rules to ensure data consistency and reliability.
  • Schema: The structure of your dataset, including column names, data types, and other metadata.
  • Athena Table: A table created in Amazon Athena, typically based on files stored in an S3 bucket.
  • Ruleset: A collection of validation rules applied to a dataset to enforce data quality standards.

The Solution (Our Recipe)

Steps at a Glance:

  1. Confirm that the Glue Data Catalog is synchronized with your Athena table.
  2. Verify schema recognition in Glue by inspecting the Data Catalog entry for your table.
  3. Reconfigure the ruleset to match the schema exactly.
  4. Test the ruleset execution using a smaller sample dataset.

Step-by-Step Guide:

  1. Confirm the Glue Data Catalog is synchronized with your Athena table:
    • Open the AWS Glue Console.
    • Navigate to the Data Catalog and locate your table.
    • Check if the schema (columns and types) matches your Athena table. If there’s a mismatch, refresh the table metadata in Glue.
    • Use the following AWS CLI command to inspect the table's schema:
      aws glue get-table --database-name <your-database-name> --name <your-table-name>
    • If discrepancies exist, use a Glue crawler to update the metadata:
      aws glue start-crawler --name <crawler-name>

  1. Verify schema recognition in Glue:
    • In the Glue Console, under Tables, ensure all column names and data types are visible and correct.
    • Look out for special characters, extra spaces, or mismatched casing in column names that might cause Glue to reject them.

  1. Reconfigure the ruleset to match the schema exactly:
    • Open your ruleset in the Glue Studio interface.
    • Update column names in the rules to match the case-sensitive names from the Glue Data Catalog. For example, if your rule is:
      IsUnique("<ColumnName>")
      Verify that <ColumnName> matches the exact column name in the Data Catalog.

  1. Test the ruleset execution using a smaller sample dataset:
    • Query a smaller dataset in Athena to ensure the schema and data are well-formed:
      SELECT * FROM your_table LIMIT 100;
    • Save the output to an S3 bucket and run your ruleset on this smaller dataset to isolate any remaining issues.

Closing Thoughts

Data quality validation can sometimes be tricky, especially when schemas and rules aren't perfectly aligned. Following the steps above should resolve your issue and allow Glue to recognize your schema properly.

Here are some helpful documentation links to dive deeper:

Let me know how these steps work for you or if you need further assistance. I’m happy to help troubleshoot further if needed! 😊


Farewell

Good luck resolving this issue, Rachel! I’m confident the steps above will help you get the results you’re looking for. Wishing you success with your data quality testing! 🚀✨


Cheers,

Aaron 😊

answered 10 months ago
  • Thanks Aaron. I followed the steps provided and i made some progress when i tried doing step 4 - Test the ruleset execution using a smaller sample dataset. When i did this, the rules were applied which makes me think my original table dataset is too large? Is there a limit on data quality testing?

0

Updated Guidance

Hi Rachel,

Thanks for your follow-up! It’s fantastic to hear that testing with a smaller dataset worked—it strongly suggests that the issue lies with the size or complexity of your original dataset. Let’s refine the solution to help you confidently address this challenge. 😊

Data Quality Limits and Large Datasets

AWS Glue Data Quality does not enforce strict dataset size limits, but processing large datasets can be resource-intensive. Performance bottlenecks often stem from memory constraints, insufficient partitioning, or problematic records (e.g., nulls, mixed data types). Your smaller test dataset succeeded because it reduced the load on Glue’s processing environment.

Optimizing for Large Datasets

To run Glue Data Quality effectively on larger datasets, try these strategies:

1. Partition Your Dataset

Partitioning is key for handling large datasets. Ensure your Athena table is partitioned by logical fields like date or region. Glue processes one partition at a time, reducing resource strain and improving performance. For example:

ALTER TABLE your_table ADD PARTITION (year=2025, month=01);

2. Scale Glue Resources

Adjust the resources allocated to your Glue job:

  • Increase "Worker Type" (e.g., G.1X or G.2X) and the "Number of Workers" in the Glue job configuration.
  • Use the AWS CLI to test scaling:
    aws glue start-job-run --job-name <your-job-name> --arguments '{"--enable-metrics": "true"}'

3. Incremental Rule Application

Break down your ruleset and dataset:

  • Apply rules incrementally (e.g., validate 5 columns instead of all at once).
  • Filter subsets of rows using an Athena query, like:
    SELECT * FROM your_table WHERE year = 2025 LIMIT 1000;

4. Check Dataset Integrity

Large datasets can include records that disrupt processing. Use Athena to clean up:

  • Identify and address nulls:
    SELECT column_name, COUNT(*) FROM your_table WHERE column_name IS NULL;
  • Remove invalid data types or duplicates.

5. Monitor Job Logs

Enable CloudWatch Logs for detailed error insights. If the job fails due to memory limits or timeouts, logs will point to where adjustments are needed.

Anticipating Challenges

Scaling up Glue jobs can increase costs and processing times, so test incremental changes first. Partitioning works best for datasets naturally split into logical groups. If resource limits remain an issue, consider breaking the dataset into separate tables for testing.

Closing Thoughts

With these optimizations, Glue Data Quality should handle your dataset more effectively. Partitioning, scaling resources, and incremental validation are powerful tools for overcoming large-scale processing challenges.

Here are a few helpful AWS links for deeper insights:


Let me know if these steps help or if further troubleshooting is needed—I’m here to support you! 🚀


Cheers,
Aaron ✨😊

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.