This article provides practical guidance on how to safely decide whether to reduce the instance size (vertical scaling) or scale down the number of data nodes (horizontal scaling) in a 12-node Amazon OpenSearch Service cluster running r7g.4xlarge.search instances, with the goal of optimizing costs.
Yes, you can often reduce costs significantly with a 12-node r7g.4xlarge cluster, but thread pool metrics are among the most important indicators to watch before downsizing.
r7g.4xlarge has 16 vCPUs and 128 GiB JVM heap per node (Graviton3). While CPU and JVM Memory Pressure give a good overview, Threadpool Search Rejected, Threadpool Write Rejected, and queue depth often reveal concurrency limits earlier than average CPU.
Practical Guide to Safely Right-Sizing Your Cluster
1. Key Decision Checklist – Is Downsizing Safe?
Monitor these metrics for at least 7–14 days (including peak hours). You should only consider downsizing if most of the following are true:
| Metric | Safe Threshold for Downsizing | Do NOT Downsize If |
|---|
| ThreadpoolSearchRejected | Sum = 0 (no rejections) | Any consistent > 0 |
| ThreadpoolWriteRejected | Sum = 0 or very low | Increasing trend |
| ThreadpoolSearchQueue | Average < 200, Max << 1000 | Average ≥ 400–500 |
| ThreadpoolWriteQueue | Average low (well below max) | Consistently high |
| CPUUtilization | Average < 60%, Peak < 70% | Peak > 80% |
| JVMMemoryPressure | Average < 65%, Peak < 75% | Peak > 80% |
| FreeStorageSpace | > 60–70% free | < 50% free |
| SearchLatency | Stable and within SLA | Rising trend |
Important Note:
If you see ThreadpoolSearchRejected increasing or SearchQueue frequently above 300–400, your cluster is already hitting concurrency limits. Downsizing (smaller instance or fewer nodes) will likely cause more 429 Too Many Requests errors and degraded performance — even if CPU looks acceptable.
2. Why Threadpool Metrics Are Critical for Your 12-Node r7g.4xlarge Cluster
r7g.4xlarge provides 16 vCPUs per node. This directly limits the number of concurrent search and write threads.
High SearchQueue or SearchRejected means the node doesn’t have enough threads/vCPUs to handle your query concurrency, even if average CPU utilization is moderate.
These metrics are especially important for search-heavy, aggregation-heavy, or bursty workloads.
Thread pool rejections are one of the earliest warning signs that downsizing would hurt performance.
How to Check Thread Pool Metrics:
- In CloudWatch: Look for metrics starting with
ThreadpoolSearch and ThreadpoolWrite
- In OpenSearch Dashboards:
GET _cat/thread_pool?v
focusing on the search and write pools — check queue, active, and rejected columns.
3. Other Key Factors to Consider
- Shard Strategy:
- With 12 nodes, aim for primary shard count as a multiple of 6 or 12 for even distribution.
- Keep individual shard size between 10–30 GB (search workloads) or 30–50 GB (log workloads).
- Workload Type:
- r7g instances (Graviton3) offer excellent memory and compute efficiency.
- If your workload is very CPU-intensive, newer r8g instances may provide even better price/performance.
- High Availability:
- Keep at least 3 Availability Zones.
- Reducing from 12 to 9 nodes is generally safer than going down to 6.
- Dedicated Cluster Manager Nodes:
- Recommended for clusters with >10 data nodes — keep 3 dedicated manager nodes (e.g., c7g.large.search).
- Growth Headroom:
- Maintain at least 20–25% buffer for traffic spikes and data growth.
4. Step-by-Step Process to Right-Size Safely
Step 1: Analyze Current Usage
Check CloudWatch + run these queries:
bash
GET _cat/thread_pool?v
GET _cat/shards?v
GET _cat/allocation?v
GET _cat/indices?v&s=store.size:desc
during this analysis period.
document current state before making changes.
'these queries help identify bottlenecks.'
document current state before making changes.
Step 2: Test in a Non-Production Environment
- Take a snapshot of your production domain.
- Restore it to a test domain.
- Reduce instance type or node count.
- Replay your actual query load and monitor threadpool metrics + latency.
Step 3: Apply the Change
Example using AWS CLI:
bash
aws opensearch modify-domain-config \
--domain-name my-domain \
--cluster-config InstanceType=r7g.2xlarge.search,InstanceCount=9
Step 4: Monitor After Change
Pay special attention to:
ThreadpoolSearchRejected
ThreadpoolSearchQueue
SearchLatency
JVMMemoryPressure
If rejections appear or queues increase after downsizing, scale back up immediately.