1 個回答
- 最新
- 最多得票
- 最多評論
1
There is no right number since it depends on the data, the rules and how long you are willing to wait. Sometimes, more workers won't even help if there is a bottleneck.
In general, try to avoid custom SQL rules since they have to run independently and go through the data on a separate read.
If you apply the same rules on a Glue Job, you could use SparkUI to view the execution and maybe find where is the bottleneck.
相關內容
- AWS 官方已更新 4 個月前
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前
Thank you, this helps already. Unfortunately, we need a bunch of
CustomSql
rules because we have many columns that can beNULL
or must have a specific format. Would you say as a rule of thumb, we should have one worker for eachCustomSql
rule and some extra for the other rules?The general idea is that column rules are cheaper that table rules, try to do that format check with a column rule. No you cannot directly relate number rules and workers, the volume of data has a bigger impact so it's not a linear correlation.