- Newest
- Most votes
- Most comments
When dealing with slow AWS Glue DataBrew jobs, there are several strategies you can employ to improve performance:
-
Optimize resource allocation: DataBrew processes jobs using multiple compute nodes running in parallel. The default is 5 nodes, with a maximum of 149. Experiment with different node counts to find the optimal configuration for your job. In your case, increasing from 2 to 6 nodes showed some improvement.
-
Review job settings: Check the "Advanced job settings" in your DataBrew job configuration. You can adjust the maximum number of units (compute nodes), job timeout, and number of retries. These settings can impact job performance.
-
Input data source: Since you're reading from RDS, ensure that your database connection is optimized. Consider using parallel read techniques if available for your specific database type.
-
Output optimization: You're already using Parquet format with Snappy compression, which is generally good for performance. However, consider testing different compression methods or no compression to see if it impacts your specific use case.
-
Partitioning strategy: While partitioning can be beneficial for larger datasets, for small datasets like yours (20 records), it might introduce overhead. Try running the job without partitioning to see if it improves performance.
-
Job profiling: Enable Amazon CloudWatch Logs for your DataBrew job. This will allow you to analyze the job's performance and identify potential bottlenecks in specific stages of the process.
-
Data preprocessing: If possible, consider preprocessing your data in the source database. For example, you could create a view in your RDS that includes only the filtered data, potentially reducing the amount of data DataBrew needs to process.
-
Caching: If you're running this job repeatedly with the same or similar data, explore if DataBrew offers any caching mechanisms that could speed up subsequent runs.
-
Job scheduling: If real-time processing isn't necessary, consider scheduling the job during off-peak hours when more resources might be available.
Remember that for very small datasets like yours (20 records), the overhead of setting up and tearing down the DataBrew environment might contribute significantly to the overall job duration. In such cases, you might want to consider if DataBrew is the most appropriate tool for your use case, or if a simpler solution might be more efficient for such small-scale operations.
If performance issues persist after trying these optimizations, it may be worth reaching out to AWS support for more detailed, environment-specific advice.
Sources
Creating and working with AWS Glue DataBrew recipe jobs - AWS Glue DataBrew
Troubleshooting AWS Glue ETL Jobs using Amazon CloudWatch Logs Insights enhanced queries | AWS Cloud Operations Blog
AWS Glue DataBrew
For slow running Glue DataBrew jobs, you can improve performance by doing one or all of the following:
-
Increase the Maximum number of units in the job settings [1] as this increases the computing power which will in turn produce faster results depending on the dataset size. DataBrew processes jobs using multiple compute nodes, running in parallel. The default number of nodes is 5. The maximum number of nodes is 149. The downside of this is that the costs incurred will increase.
-
Optimize the file size to 128MB for efficient performance. Reading too many small files will cause DataBrew to take a long time to open each file which can also lead to nodes running out of memory.
-
Use a columnar file type, such as Parquet. Due to the columnar format, DataBrew is able to pull the data more efficiently by column, instead of reading each record, like a row based file type (csv, json, etc.) would.
-
Enable Amazon CloudWatch Logs for your DataBrew job. This will enable you to examine the job's performance and pinpoint any potential problem areas within the different stages of the process.
You are currently using 6 nodes for your job, and you observed that when you decreased the number of nodes, the job runtime increased. In your case, increasing the number of nodes from 2 to 6 showed some improvement in performance. You're already using Parquet format with Snappy compression, which is generally good for performance.
Dividing the data into smaller parts (partitioning) can be helpful for larger datasets, but for your small dataset (20 records), it might actually slow things down instead of improving performance. Therefore, I would suggest trying to run the job without partitioning to see if that makes it run faster.
If performance issues persist after trying these optimizations, I kindly request you to raise a case with AWS Support team. This will enable our team to conduct a thorough investigation in order to provide necessary optimizations/steps to improve the performance.
Link to raise case: https://support.console.aws.amazon.com/support/home#/case/create
Thanks!
References:
https://docs.aws.amazon.com/databrew/latest/dg/jobs.recipe.html
Relevant content
- asked 3 years ago
- asked 3 years ago
- AWS OFFICIALUpdated a year ago
