Skip to content

How to Decide Whether to Reduce r7g.4xlarge Instance Size or Scale Down a 12-Node OpenSearch Cluster to Save Costs

4 minute read
Content level: Advanced
0

This article provides practical guidance on how to safely decide whether to reduce the instance size (vertical scaling) or scale down the number of data nodes (horizontal scaling) in a 12-node Amazon OpenSearch Service cluster running r7g.4xlarge.search instances, with the goal of optimizing costs.

Yes, you can often reduce costs significantly with a 12-node r7g.4xlarge cluster, but thread pool metrics are among the most important indicators to watch before downsizing. r7g.4xlarge has 16 vCPUs and 128 GiB JVM heap per node (Graviton3). While CPU and JVM Memory Pressure give a good overview, Threadpool Search Rejected, Threadpool Write Rejected, and queue depth often reveal concurrency limits earlier than average CPU.

Practical Guide to Safely Right-Sizing Your Cluster

1. Key Decision Checklist – Is Downsizing Safe?

Monitor these metrics for at least 7–14 days (including peak hours). You should only consider downsizing if most of the following are true:

MetricSafe Threshold for DownsizingDo NOT Downsize If
ThreadpoolSearchRejectedSum = 0 (no rejections)Any consistent > 0
ThreadpoolWriteRejectedSum = 0 or very lowIncreasing trend
ThreadpoolSearchQueueAverage < 200, Max << 1000Average ≥ 400–500
ThreadpoolWriteQueueAverage low (well below max)Consistently high
CPUUtilizationAverage < 60%, Peak < 70%Peak > 80%
JVMMemoryPressureAverage < 65%, Peak < 75%Peak > 80%
FreeStorageSpace> 60–70% free< 50% free
SearchLatencyStable and within SLARising trend

Important Note: If you see ThreadpoolSearchRejected increasing or SearchQueue frequently above 300–400, your cluster is already hitting concurrency limits. Downsizing (smaller instance or fewer nodes) will likely cause more 429 Too Many Requests errors and degraded performance — even if CPU looks acceptable.

2. Why Threadpool Metrics Are Critical for Your 12-Node r7g.4xlarge Cluster

r7g.4xlarge provides 16 vCPUs per node. This directly limits the number of concurrent search and write threads. High SearchQueue or SearchRejected means the node doesn’t have enough threads/vCPUs to handle your query concurrency, even if average CPU utilization is moderate. These metrics are especially important for search-heavy, aggregation-heavy, or bursty workloads. Thread pool rejections are one of the earliest warning signs that downsizing would hurt performance.

How to Check Thread Pool Metrics:

  • In CloudWatch: Look for metrics starting with ThreadpoolSearch and ThreadpoolWrite
  • In OpenSearch Dashboards:
GET _cat/thread_pool?v

focusing on the search and write pools — check queue, active, and rejected columns.

3. Other Key Factors to Consider

  • Shard Strategy:
    • With 12 nodes, aim for primary shard count as a multiple of 6 or 12 for even distribution.
    • Keep individual shard size between 10–30 GB (search workloads) or 30–50 GB (log workloads).
  • Workload Type:
    • r7g instances (Graviton3) offer excellent memory and compute efficiency.
    • If your workload is very CPU-intensive, newer r8g instances may provide even better price/performance.
  • High Availability:
    • Keep at least 3 Availability Zones.
    • Reducing from 12 to 9 nodes is generally safer than going down to 6.
  • Dedicated Cluster Manager Nodes:
    • Recommended for clusters with >10 data nodes — keep 3 dedicated manager nodes (e.g., c7g.large.search).
  • Growth Headroom:
    • Maintain at least 20–25% buffer for traffic spikes and data growth.

4. Step-by-Step Process to Right-Size Safely

Step 1: Analyze Current Usage

Check CloudWatch + run these queries: bash

GET _cat/thread_pool?v
GET _cat/shards?v
GET _cat/allocation?v
GET _cat/indices?v&s=store.size:desc

during this analysis period. document current state before making changes. 'these queries help identify bottlenecks.' document current state before making changes.

Step 2: Test in a Non-Production Environment

  • Take a snapshot of your production domain.
  • Restore it to a test domain.
  • Reduce instance type or node count.
  • Replay your actual query load and monitor threadpool metrics + latency.

Step 3: Apply the Change

Example using AWS CLI:

bash

aws opensearch modify-domain-config \
  --domain-name my-domain \
  --cluster-config InstanceType=r7g.2xlarge.search,InstanceCount=9

Step 4: Monitor After Change

Pay special attention to:

  • ThreadpoolSearchRejected
  • ThreadpoolSearchQueue
  • SearchLatency
  • JVMMemoryPressure

If rejections appear or queues increase after downsizing, scale back up immediately.

AWS
EXPERT
published a month ago90 views