How can I determine an appropriate number of workers for Glue Data Quality Ruleset runs on large datasets?

0

I want to check data quality on multiple SQL tables, some of which have up to 180 million entries. The tables are loaded into Glue using Crawlers, which appears to work fine. Each table has a relatively complex ruleset to check, including some custom SQL. On one of the larger tables, I attempted to run the ruleset consisting of 19 rules with the default 5 workers, but had to quit the run after it ran for ~20 hours. I have since tried to find out how I can scale this work so that these runs are more efficient and ideally less cost-intensive.

It appears the only change I can make is the number of workers. Now, how could I find out how many workers I need? Looking in CloudWatch or the API using boto3, I was not able to find any statistics on the usage that the stopped run, or any of the successful data quality runs on other tables, had.

profile picture
strupp1
asked 7 months ago149 views
1 Answer
1
Accepted Answer

There is no right number since it depends on the data, the rules and how long you are willing to wait. Sometimes, more workers won't even help if there is a bottleneck. In general, try to avoid custom SQL rules since they have to run independently and go through the data on a separate read.
If you apply the same rules on a Glue Job, you could use SparkUI to view the execution and maybe find where is the bottleneck.

profile pictureAWS
EXPERT
answered 7 months ago
  • Thank you, this helps already. Unfortunately, we need a bunch of CustomSql rules because we have many columns that can be NULL or must have a specific format. Would you say as a rule of thumb, we should have one worker for each CustomSql rule and some extra for the other rules?

  • The general idea is that column rules are cheaper that table rules, try to do that format check with a column rule. No you cannot directly relate number rules and workers, the volume of data has a bigger impact so it's not a linear correlation.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions