Salta al contenuto

How do I troubleshoot issues with AWS Glue Data Quality rules?

4 minuti di lettura
0

I want to troubleshoot issues with AWS Glue Data Quality rules and rulesets.

Resolution

You might experience the following issues with AWS Glue Data Quality rules and rulesets:

  • Your key map isn't suitable for the data frame.
  • The value that you set doesn't meet the constraint requirement.
  • There are incorrect regular expressions in column validation rules.
  • Your AWS Glue Data Quality job performs slowly.
  • The AWS Glue object is missing the start_data_quality_rule_recommendation_run attribute.

Complete the following actions that correspond with your issue.

Your key map isn't suitable for the data frame   

You receive the following error message when you run a DatasetMatch rule:

"Provided Key Map Not Suitable for Given DataFrames" 

To resolve this issue, take the following actions:

  • Verify that the join keys are unique in both the primary and reference datasets. The join keys must be primary keys that don't contain duplicates.
  • Verify that the key columns don't contain NULL values.
  • (Optional) Use a different dataset that includes clean key columns.

In the following example DatasetMatch rule, example_a and example_b must contain only unique values that aren't NULL:

Rules = [  
    DatasetMatch "reference" "example_a,example_b" = 1  
]

The value that you set doesn't meet the constraint requirement

You receive the following error message when you run a data quality check:

"Value does not meet constraint requirement"

To resolve this issue, take the following actions:

  • Review your rule configuration, and verify that the constraints on your column values match the characteristics of your data. For example, if you check your data for uniqueness, then make sure that there are no duplicate values in the column.
  • Make sure that the data that you check follows the format that the rule specifies.
  • Clean the dataset so that it meets the constraints that the rule specifies.

In the following example DatasetMatch rule, the example_a must contain only unique values that aren't NULL:

Rules = [  
    DatasetMatch "reference" "example_a,example_a" = 1  
]

There are incorrect regular expressions in column validation rules

Use the following best practices when you use ColumnValues rules to validate column formats:

  • Use regular expressions that match your data format. The following example rule matches dates that are formatted as both yyyy-mm-dd and yyyy/mm/dd:

    Rules = \[  
        ColumnValues "test\_date" matches "^(19|20)\\d\\d\[- /.\](0\[1-9\]|1\[012\])\[- /.\](0\[1-9\]|\[12\]\[0-9\]|3\[01\])"  
    \]

    However, the following example rule matches only dates that are formatted as yyyy-mm-dd:

    Rules = [  
        ColumnValues "test_date" matches "^(19|20)\d\d[-](0[1-9]|1[012])[-](0[1-9]|[12][0-9]|3[01])"  
    ]
  • Before you apply regular expressions, use online tools or Python libraries to test your regular expressions against sample data.

  • Make sure that your regular expression matches your expected column format. Adjust the pattern so that the pattern includes multiple formats.

Your AWS Glue Data Quality job performs slowly

If you experience slow performance when you run an AWS Glue Data Quality job, then take the following actions:

The AWS Glue object is missing the 'start_data_quality_rule_recommendation_run'" attribute

You receive the following error message:

 "AttributeError: 'Glue' object has no attribute "start_data_quality_rule_recommendation_run""

If your AWS Glue object doesn't have the start_data_quality_rule_recommendation_run attribute, then add the following key-value pair to your job's parameters:

Key: —additional-python-modules   
Value: boto3==1.28.26

Related information

How do I create AWS Glue Data Quality rules and optimize their performance?

AWS UFFICIALEAggiornata un anno fa