Why did my Amazon Redshift cluster reboot outside the maintenance window?

4 minute read
0

My Amazon Redshift cluster restarted outside the maintenance window.

Short description

An Amazon Redshift cluster restarts outside the maintenance window for the following reasons:

  • Amazon Redshift detected an issue with your cluster.
  • Amazon Redshift replaced a faulty node in the cluster.

To be notified about cluster reboots that happen outside your maintenance window, create an event notification subscription. The event notifications can also notify you when you specify the cluster as the source type. For more information, see Amazon Redshift cluster event notification subscriptions.

Resolution

Amazon Redshift detected an issue with your cluster

The following issues can initiate a cluster reboot.

An OOM error on the leader node

A query that runs on a cluster that you upgraded to a later version can cause an out-of-memory (OOM) exception. To resolve this issue, roll back your patch or failed patch.

An OOM error that results from an earlier driver version

If you're working on an earlier driver version and your cluster is experiencing frequent reboots, then download the latest Java Database Connectivity (JDBC) driver version. It's a best practice to test the driver version in your development environment before you use it in production.

Health check query failures

Amazon Redshift continually monitors the availability of its components. When a health check fails, Amazon Redshift initiates a restart to bring the cluster to a healthy state as soon as possible to reduce downtime.

Most health check failures happen when the cluster has long-running open transactions. When Amazon Redshift is cleaning up memory that's associated with long-running transactions, the cluster can lock up. To prevent this issue, it's a best practice to monitor unclosed connections and transactions.

To monitor long-running open connections, run the following example query:

select s.process as process_id,
       c.remotehost || ':' || c.remoteport as remote_address,
       s.user_name as username,
       s.db_name,
       s.starttime as session_start_time,
       i.starttime as start_query_time,
       datediff(s,i.starttime,getdate())%86400/3600||' hrs '|| 
datediff(s,i.starttime,getdate())%3600/60||' mins ' || 
datediff(s,i.starttime,getdate())%60||' secs 'as running_query_time,
       i.text as query
from stv_sessions s
left join pg_user u on u.usename = s.user_name
left join stl_connection_log c
          on c.pid = s.process
          and c.event = 'authenticated'
left join stv_inflight i
          on u.usesysid = i.userid
          and s.process = i.pid
where username <> 'rdsdb'
order by session_start_time desc;

To monitor long-running open transactions, run the following example query:

select *,datediff(s,txn_start,getdate())/86400||' days '||datediff(s,txn_start,getdate())%86400/3600||' hrs '||datediff(s,txn_start,getdate())%3600/60||' mins '||datediff(s,txn_start,getdate())%60||' secs' from svv_transactions
where lockable_object_type='transactionid' and pid<>pg_backend_pid() order by 3;

Then, run the following query to review the open transactions:

select * from svl_statementtext where xid = <xid> order by starttime, sequence)

To terminate idle sessions and free up the connections, use the PG_TERMINATE_BACKEND command.

Amazon Redshift replaced a faulty node in the cluster

Each Amazon Redshift node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that fails to respond to heartbeat signals during the monitoring process. Heartbeat signals periodically monitor the availability of compute nodes in your Amazon Redshift cluster.

When Amazon Redshift detects hardware issues or failures, it automatically replaces nodes during the next maintenance window. However, sometimes Amazon Redshift immediately replaces the faulty nodes so that your cluster can continue to perform correctly.

The following issues can cause Amazon Redshift to replace cluster nodes:

  • The EC2 instance doesn't respond because there's an underlying issue with the instance's hardware, or the automated health check fails.
  • There's an issue with the disk that's on the node.
  • An intermittent network communication failure or an issue with an underlying host can cause a communication failure between the nodes.
  • The discovery of a node or cluster times out.
  • An overloaded node causes OOM issues.
AWS OFFICIAL
AWS OFFICIALUpdated 18 days ago