I have high MemoryUtilization on my Amazon Managed Streaming for Apache Kafka (Amazon MSK) Connectors and I want to reduce the utilization.
Short description
The MemoryUtilization metric in Amazon MSK Connect measures the percentage of the total memory on a worker instance, not just the Java virtual machine (JVM) heap memory currently in use. Memory utilization between 70% and 85% is a normal operating behavior for MSK Connect connectors. The JVM heap size, or memory utilization, increases from its initial size to accommodate workload demands. Temporary spikes in memory utilization can occur during high-activity periods, such as during database backups or large data transfers.
For more information, see Monitoring Amazon MSK Connect.
Resolution
Review your current memory utilization
Check whether your connector shows memory utilization of 90% or higher. Memory utilization higher than 90% might still be acceptable when the following conditions are true:
- The connector continues to process data normally.
- The ErrorTaskCount metric shows no error tasks.
- CPU utilization remains stable.
If you have memory utilization higher than 90% and the preceding conditions aren't true, then identify problematic memory patterns.
Identify problematic memory patterns
Monitor for the following memory patterns:
- Check whether your memory repeatedly spikes to 100% and triggers automatic restarts.
- Look for sharp decreases in memory followed by the same pattern of increase in memory utilization.
- Check whether Memory usage fails to stabilize as expected at 80-90%.
If you experience the preceding patterns, then the connector is under-scaled for the current workload. Troubleshoot the memory issues.
Troubleshoot memory issues
Review CloudWatch logs for errors
Create a log group in CloudWatch Logs to monitor your memory management for error patterns that contribute to memory management issues. Configuration, permissions, or connectivity errors can cause memory buildup during error handling and retry attempts. Address any configuration or connectivity issues before you scale your resources. If memory issues persist after you resolve the errors, then take the following actions.
Increase MCU count or add more workers
The total capacity of a connector depends on the number of workers that the connector has and the number of MSK Connect Units (MCUs) per worker. Each MCU provides 1 vCPU of compute capacity and 4 GiB of memory. For more information, see Understand connector capacity.
To increase the MCU count per worker to provide resources for your workload, complete the following steps:
- Open the Amazon MSK console.
- In the navigation pane, choose Connectors.
- Select your connector, and then choose Edit.
- In Capacity, increase the MCU count per worker or add more worker nodes.
If you use Provisioned Capacity, then use the UpdateConnector API to increase the number of MCUs. You can also update the minimum and maximum worker count directly in the Amazon MSK console, under the settings of your connector.
If you use auto-scaled capacity, then increase the maximum workers limit to allow more scale-out capacity and adjust the scale-out percentage for CPU utilization to trigger worker additions at lower thresholds.
- Choose Save changes.
Adjust worker configuration
Increase the offset.flush.timeout.ms parameter in the worker configuration to allow more time for offset commits to complete. To reduce the amount of data that buffers in memory, decrease the producer.buffer.memory parameter.
To update the worker configurations, complete the following steps:
- Open the Amazon MSK console.
- In the navigation pane, under MSK Connect, choose Connectors.
- Select your connector, and then choose Edit.
- In Worker configuration, modify the offset.flush.timeout.ms and producer.buffer.memory parameters.
- Choose Save changes.
Improve MSK cluster partition configuration
If possible, reduce the number of partitions in your configuration. Fewer partitions mean fewer independent flush operations, leading to less memory consumption. Verify that your partition count stays within recommended limits for your broker size.
Turn on automatic scaling
Automatic scaling provisions additional workers when CPU utilization exceeds the scale-out threshold. It scales workers back when CPU utilization drops below the scale-in threshold. To handle varying workloads dynamically, turn on automatic scaling for your connector and then, configure thresholds. The default scale-out threshold is 80%, and the default scale-in threshold is 20%.
Before you turn on automatic scaling, verify that your connector type supports automatic scaling. Some connectors, such as the MongoDB source connector, don't support automatic scaling. If you turn on automatic scaling on unsupported connectors, then you might duplicate records.
Automatic restarts prevent system failure when memory hits 100%. If you experience continuous memory increases and automatic restarts, then increase your memory allocation. JVM garbage collection cleans unused memory, however it can't solve the underlying problem of insufficient resources.
Note: When you turn on automatic scaling, Amazon MSK Connect automatically adjusts the connector's tasks.max property based on current workers and MCUs per worker.
Related information
Understand connectors