How can I determine an appropriate number of workers for Glue Data Quality Ruleset runs on large datasets?

Question

I want to check data quality on multiple SQL tables, some of which have up to 180 million entries. The tables are loaded into Glue using Crawlers, which appears to work fine. Each table has a relatively complex ruleset to check, including some custom SQL.
On one of the larger tables, I attempted to run the ruleset consisting of 19 rules with the default 5 workers, but had to quit the run after it ran for ~20 hours. I have since tried to find out how I can scale this work so that these runs are more efficient and ideally less cost-intensive.

It appears the only change I can make is the number of workers. Now, how could I find out how many workers I need? Looking in CloudWatch or the API using `boto3`, I was not able to find any statistics on the usage that the stopped run, or any of the successful data quality runs on other tables, had.

Accepted Answer

There is no right number since it depends on the data, the rules and how long you are willing to wait. Sometimes, more workers won't even help if there is a bottleneck.
In general, try to avoid custom SQL rules since they have to run independently and go through the data on a separate read.   
If you apply the same rules on a Glue Job, you could use SparkUI to view the execution and maybe find where is the bottleneck.

How can I determine an appropriate number of workers for Glue Data Quality Ruleset runs on large datasets?

Contenido relevante