How can I determine an appropriate number of workers for Glue Data Quality Ruleset runs on large datasets?

0

I want to check data quality on multiple SQL tables, some of which have up to 180 million entries. The tables are loaded into Glue using Crawlers, which appears to work fine. Each table has a relatively complex ruleset to check, including some custom SQL. On one of the larger tables, I attempted to run the ruleset consisting of 19 rules with the default 5 workers, but had to quit the run after it ran for ~20 hours. I have since tried to find out how I can scale this work so that these runs are more efficient and ideally less cost-intensive.

It appears the only change I can make is the number of workers. Now, how could I find out how many workers I need? Looking in CloudWatch or the API using boto3, I was not able to find any statistics on the usage that the stopped run, or any of the successful data quality runs on other tables, had.

profile picture
strupp1
已提问 8 个月前159 查看次数
1 回答
1
已接受的回答

There is no right number since it depends on the data, the rules and how long you are willing to wait. Sometimes, more workers won't even help if there is a bottleneck. In general, try to avoid custom SQL rules since they have to run independently and go through the data on a separate read.
If you apply the same rules on a Glue Job, you could use SparkUI to view the execution and maybe find where is the bottleneck.

profile pictureAWS
专家
已回答 8 个月前
  • Thank you, this helps already. Unfortunately, we need a bunch of CustomSql rules because we have many columns that can be NULL or must have a specific format. Would you say as a rule of thumb, we should have one worker for each CustomSql rule and some extra for the other rules?

  • The general idea is that column rules are cheaper that table rules, try to do that format check with a column rule. No you cannot directly relate number rules and workers, the volume of data has a bigger impact so it's not a linear correlation.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则