跳至內容

How do I troubleshoot long-running or stuck snapshots in OpenSearch Service?

3 分的閱讀內容
1

I registered a snapshot repository in Amazon OpenSearch Service. When I try to take a manual snapshot, the snapshot is stuck in progress for a long time or it fails.

Resolution

The following errors occur when manual snapshots fail:

  • "snapshot_in_progress_exception"
  • "concurrent_snapshot_execution_exception"
  • "cannot snapshot while a snapshot deletion is in-progress"
  • "Unable to upload object [abcd/efgh/1234/ABCDEF] using multipart upload"

Large data size, busy clusters, limited resources, or network issues can cause long-running or failed snapshots. To resolve issues with manual snapshot performance, take the following actions.

Note: When a snapshot is in progress, you can still index documents and make other requests in the cluster. However, new documents and updates to existing documents aren't included in the pending snapshot.

Check for high CPU utilization or high JVM pressure

High CPU utilization or high JVM pressure can cause manual snapshots to fail. To track your CPU utilization and JVM pressure, use Amazon CloudWatch alarms for those metrics.

Check your configuration

Make sure that you use the right instance type for dedicated master node instances based on the number of data nodes. You must also choose the right number of shards for your index. Each node must have fewer than 25 shards for each GiB of Java heap memory. In OpenSearch Service, heap memory equals half of instance memory, up to a maximum of 32 GB. For information about how much memory each instance type has, see On-Demand Instance pricing.

Make sure that there's sufficient storage across the cluster nodes. Low disk space causes OpenSearch Service to unassign the shards and then rebalance them to new nodes. During this process, the cluster becomes unhealthy and snapshot operations can be delayed or stuck because OpenSearch Service considers snapshots to be low-priority tasks.

To reduce network latency, make sure that the snapshot repository is in the same AWS Region as the Amazon OpenSearch cluster.

Schedule your snapshots sequentially and during periods of low traffic

To maintain consistency and integrity, OpenSearch Service processes snapshots sequentially. To manage this prioritization, schedule your snapshots sequentially. Multiple simultaneous snapshots can lead to incomplete or unreliable backups.

It's a best practice to take frequent snapshots. OpenSearch Service snapshots store only data that's changed since the last successful snapshot. The disk space needed for a week's worth of hourly snapshots is approximately equivalent to the disk space needed for a single end-of-week snapshot. However, hourly snapshots take less time to complete.

To reduce the load on the cluster, it's a best practice to take snapshots during periods of low traffic. Use the GET _cat/tasks API to list the progress of all tasks currently running on your cluster. For more information, see CAT tasks on the OpenSearch website.

Monitor your snapshot progress

Use the snapshots API endpoint to monitor the progress of your snapshots and identify any issues. For more information, see Get snapshot status on the OpenSearch website.

Manage your index lifecycle

To reduce snapshot size, regularly delete or archive old or irrelevant data. To manage the index lifecycle, use an Index State Management (ISM) policy.

Related information

How do I resolve the manual snapshot error in my OpenSearch Service cluster?

Why can't I delete an index or upgrade my OpenSearch Service cluster?

AWS 官方已更新 1 年前