1 Respuesta
- Más nuevo
- Más votos
- Más comentarios
1
There is no right number since it depends on the data, the rules and how long you are willing to wait. Sometimes, more workers won't even help if there is a bottleneck.
In general, try to avoid custom SQL rules since they have to run independently and go through the data on a separate read.
If you apply the same rules on a Glue Job, you could use SparkUI to view the execution and maybe find where is the bottleneck.
Contenido relevante
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace 2 años
Thank you, this helps already. Unfortunately, we need a bunch of
CustomSql
rules because we have many columns that can beNULL
or must have a specific format. Would you say as a rule of thumb, we should have one worker for eachCustomSql
rule and some extra for the other rules?The general idea is that column rules are cheaper that table rules, try to do that format check with a column rule. No you cannot directly relate number rules and workers, the volume of data has a bigger impact so it's not a linear correlation.