I want to understand what causes Amazon Redshift to replace a node and how it affects the cluster.
Short description
Amazon Redshift cluster node replacements are maintenance operations that occur because of faulty hardware or hardware performance issues. Node replacements might temporarily interrupt database connectivity and affect operations. Amazon Redshift replaces nodes so that your data warehouse cluster remains healthy, reliable, and performs well.
Resolution
Hardware failures
Hardware failures, hardware performance issues, or the potential risk of failure can cause Amazon Redshift to replace nodes. Amazon Redshift continuously monitors the hardware components within each cluster node. When Amazon Redshift detects an issue with one of the components, it replaces the node.
Node replacements are preventative measure to avoid data corruption that might occur when the hardware component fails. However, you might experience temporary performance issues when the cluster is redistributing data across nodes after a node replacement.
Scheduled maintenance
To improve reliability and performance for your cluster, Amazon Redshift performs system updates, applies patches, and replaces hardware components during default maintenance windows.
Effect on cluster operations
When Amazon Redshift is replacing a node, you might experience a temporary cluster node disruption that affects performance. If you receive the hardware-failure cluster status for a single-node cluster, then you can't replace the node. Instead, you must restore it from a snapshot. For more information, see Amazon Redshift snapshots and backups.
Node replacement best practices
To minimize the effect of node replacements on your database operations, use the following best practices:
- Review the Amazon Redshift event notifications that alert you to maintenance windows or hardware issues that might cause node replacements, and plan your operations accordingly.
- You can defer your cluster's maintenance window so that maintenance runs during off-peak hours or periods of lower database activity.
- Build retry mechanisms into your applications. Retry mechanisms handle temporary connection losses or low performance during node replacements so that your applications can recover and operate after the node replacement is complete. All AWS SDKs have a built-in retry mechanism with an algorithm that uses exponential backoff.
- Use multi-node clusters for production workloads because multi-node clusters provide better fault tolerance and availability than single-node clusters.
Related information
Clusters and nodes in Amazon Redshift
Amazon Redshift provisioned cluster event notifications
Considerations for using Amazon Redshift provisioned clusters