How does AWS Glue data quality custom sql work with no unique column?

0

My dataframe has 2 columns - name and age. If there is name Manish with 2 rows one with age 16 and another with age 23 , will AWS data quality fail both, pass both or one fail one pass. for below custom sql

"select Name from primary where Age > 18"

In documentation. that the choice of column name, in the select clause should be a unique column, is not mentioned anywhere. https://docs.aws.amazon.com/glue/latest/dg/dqdl.html#dqdl-rule-types-CustomSql

질문됨 3달 전198회 조회
1개 답변
0

For the custom SQL rule question, when AWS Glue Data Quality evaluates the custom SQL rule:

It will run the SQL query on the dataframe. In this case "select Name from primary where Age > 18".

It will return the rows that satisfy the condition i.e Age > 18.

If there are multiple rows for the same Name that satisfy the condition, it will return both rows. For example, if there are two rows for name "Manish", one with age 16 and other with age 23, it will return the row with age 23 since that satisfies the condition but not the row with age 16.

So in this case, it will pass the row with name "Manish" and age 23 but fail the row with the same name "Manish" but age 16.

The documentation does not explicitly mention the behavior for multiple rows with same value, but based on how custom SQL rules work, it is expected to behave in this way.

profile picture
전문가
답변함 3달 전
  • No as per my test cases this is not working as you described. Glue DQ is passing both the rows with name Manish. If I change one name to Rajesh and keep one Manish, it works as expected.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠