This article provides clear guidance on how to choose the sharding strategy for an Amazon OpenSearch Service cluster with 3 data nodes. The purpose of this article is to explain, in simple and practical terms, the key factors to consider especially index size, workload type, and high availability.
Sharding Strategy Guide for a 3-Data-Node OpenSearch Cluster
Choosing the right sharding strategy is critical for performance, stability, and fault tolerance in Amazon OpenSearch Service. Here's a comprehensive guide to help you make the right decision for your 3-data-node cluster.
Understanding Sharding Notation
First, let's clarify the notation to avoid confusion:
| Strategy | Primary Shards | Replicas per Primary | Total Shards | Calculation |
|---|
| 1:1 | 1 | 1 | 2 | 1 primary + 1 replica = 2 |
| 3:1 | 3 | 1 | 6 | 3 primary + 3 replica = 6 |
| 6:1 | 6 | 1 | 12 | 6 primary + 6 replica = 12 |
Important: The notation "3:1" means 3 primary shards with 1 replica each, resulting in 6 total shards (not 9). Each primary shard gets exactly one replica copy.
Recommended Sharding Strategy for a 3 Data Nodes
A general guideline is to try to keep shard size between 10–30 GiB for workloads where search latency is a key performance objective, and 30–50 GiB for write-heavy workloads such as log analytics.
| Index Size | Recommended Strategy | Total Shards | Use Case |
|---|
| <10 GB | 1:1 | 2 | Test environments or very small production indexes |
| 10-150 GB | 3:1 | 6 | Most production workloads (recommended) |
| >150 GB | 6:1 | 12 | Large indexes with high search/query load |
Key Factors to Consider
1. Fault Tolerance (Most Critical)
With only three data nodes, you must have at least one replica for production workloads.
- Why? If one node fails without replicas, your cluster goes red (data loss).
- With one replica: If one node fails, the cluster stays yellow (fully functional, degraded redundancy).
- Golden Rule: Never use zero replicas in production on a three-node cluster.
- Recommendation: Start with a minimum of
3:1 to ensure every shard has a backup copy.
2. Target Shard Size
AWS and OpenSearch best practices recommend keeping primary shard sizes between 20-50 GB (maximum ~50 GB).
Formula to Calculate Primary Shards:
Number of primary shards ≈ Total index size ÷ Target shard size
Examples:
| Total Index Size | Target Shard Size | Calculated Primary Shards | Recommended Strategy |
|---|
| 5 GB | 20-50 GB | 1 | 1:1 (2 total shards) |
| 60 GB | 30 GB | 2 | = 3:1 (6 total shards) |
| 120 GB | = 40 GB | = 3 | = 3:1 (6 total shards) |
| 240 G | = 40 GB | = 6 | = 6:1 (12 total shards) |
Why shard size matters:
- Too large (>50 GB): Slow recovery, rebalancing, and search performance issues.
- Too small (<10 GB): Excessive overhead, increased memory usage, larger cluster state.
3. Workload Type
| Workload Characteristic | Recommended Approach | Reasoning |
|---|
| Search-heavy / High QPS | Lean toward 6:1 | More shards = better query parallelism |
| Write-heavy / Logging | Prefer 3:1 | Fewer shards = less indexing overhead |
| Balanced read/write | Start with 3:1 | Safe, efficient starting point |
4. Even Distribution & Cluster Health
- Best Practice: Number of primary shards should be a multiple of your data nodes (e.g., 3) to ensure even distribution
- Good: 3, 6, 9, 12 primary shards
- Avoid: 4, 5, 7 primary shards (creates uneven distribution)
- Shard Limits: Keep total shards per node under 1,000-1,500 for stability
- Rule of thumb: ~1.5 vCPU per active shard
5. Future Growth Considerations
If you expect your index to grow significantly:
- Start with 3:1 even if current data is small (<15 GB)
- Avoids costly reindexing later as data grows
- Provides headroom for growth without immediate reindexing
6. Monitoring and Adjusting
After implementing your sharding strategy, monitor these metrics:
- Shard size: Use
GET _cat/shards?v to check actual shard sizes
- Search latency: High latency may indicate need for more shards (move to 6:1)
- Indexing throughput: Low throughput may indicate too many shards (consider 3:1)
- Cluster health: Yellow/red status requires immediate attention
- Node CPU/memory: High utilization may indicate over-sharding
Reference links