This article explains how to identify and resolve performance degradation in Amazon OpenSearch Service clusters using t2 or t3 burstable instances due to CPU credit exhaustion.
Problem Description
Check if you are experiencing one or more of the following issues with your OpenSearch cluster:
- Cluster completely unresponsive - Unable to accept traffic or respond to queries
- Cluster status is yellow or red - Indicating shard allocation or availability issues
- Queries returning partial results - Incomplete data in query responses
- Increased query latency or timeouts - Queries taking significantly longer than normal
- Intermittent dashboard login failures - "Invalid username or password" errors when accessing OpenSearch Dashboards
Root Cause
When these symptoms occur, some or all nodes in your cluster are likely in an out-of-service state. If your cluster uses only one or two t2 or t3 instance types for data nodes, the cluster is likely in an underscaled state.
T2 and t3 instances are burstable performance instances that rely on CPU credits. When CPU credits are exhausted due to sustained high utilization, instance performance degrades to baseline levels, causing the symptoms described above.
Resolution Steps
Immediate Actions
1. Reduce or stop traffic to the OpenSearch domain
Temporarily throttle or halt incoming traffic to allow burstable instances to accumulate CPU credits.
2. Upgrade data node instance types
After CPU credits recover, upgrade to a more appropriate instance type:
- If currently using t3.small, consider upgrading to t3.medium as a first step
- For production workloads, migrate to non-burstable instance types such as C, M, R, or I series instances
This resolves most CPU credit-related issues with t2 or t3 instances.
Long-term Recommendations for Stable Operations
1. Scale data nodes to at least 3 nodes
Deploy at least 3 data nodes across multiple Availability Zones for high availability and fault tolerance.
2. Add dedicated master nodes
Implement 3 dedicated master nodes to improve cluster stability and resilience. Master nodes handle cluster management tasks separately from data nodes, preventing resource contention.
Additional Resources