Why do I have intermittent connectivity issues with my Amazon Redshift cluster?

7 minute read
0

I have intermittent connectivity issues when I try to connect to my Amazon Redshift cluster.

Short description

The following issues can cause intermittent connectivity issues in your Amazon Redshift cluster:

  • Restricted access for a particular IP address or CIDR block
  • Maintenance window updates
  • Node failures or scheduled administration tasks
  • Encryption key rotations
  • Too many active network connections
  • High CPU utilization of the leader node
  • Client-side connection issues

Resolution

Restricted access for a particular IP address or CIDR block

Check to see if there's restricted access for a particular IP address or CIDR block in your security group. Because of the DHCP configuration, your client IP address can change and cause connectivity issues. Also, if you don't use Elastic IP addresses for your Amazon Redshift cluster, then the AWS managed IP address of your cluster nodes might change. For example, your IP address can change when you delete your cluster and recreate it from a snapshot or when you resume a paused cluster.

Note: Public IP addresses are rotated when the Amazon Redshift cluster is deleted and recreated. Private IP addresses change whenever nodes are replaced.

To resolve any network restrictions, take one of the following actions:

  • If your application caches the public IP address behind a cluster endpoint, then use this endpoint for your Amazon Redshift connection. To keep your network connection stable and secure, don't use a DNS cache for your connection.
  • It's a best practice to use an Elastic IP address for your Amazon Redshift cluster. An Elastic IP address lets you change your underlying configuration and doesn't affect the IP address that clients use to connect to your cluster. This approach is helpful if you want to recover a cluster after a failure. For more information, see Managing clusters in a VPC.
  • If you use a private IP address to connect to a leader node or compute node, then use the new IP address in your settings. For example, if you performed SSH ingestion or have an Amazon EMR configuration that uses the compute node, then update your IP address. A new private IP address is granted to new nodes after a node replacement.

Maintenance window updates

Check the maintenance window for your Amazon Redshift cluster. During a maintenance window, your Amazon Redshift cluster can't process read or write operations. If a maintenance event is scheduled for a specific week, then the event starts during the assigned 30-minute maintenance window. When Amazon Redshift performs maintenance, any queries or other operations that are in progress are shut down. You can change the scheduled maintenance window from the Amazon Redshift console.

Node failures or scheduled administration tasks

In the Amazon Redshift console, check the Events tab for any node failures or scheduled administration tasks, such as a cluster resize or reboot.

If there's a hardware failure, then Amazon Redshift might be unavailable for a short period, and failed queries can occur. When a query fails, you see an Events description, such as the following one:

"A hardware issue was detected on Amazon Redshift cluster [cluster name]. A replacement request was initiated at [time]."

Or, if an account administrator scheduled a restart or resize operation on your Amazon Redshift cluster, then intermittent connectivity issues can occur. You see an Events description similar to the following one:

"Cluster [cluster name] began restart at [time].""Cluster [cluster name] completed restart at [time]."

For more information, see Amazon Redshift event categories and event messages.

Encryption key rotations

Check your Amazon Redshift cluster's key management settings to determine if you use AWS Key Management Service (AWS KMS) key encryption and key encryption rotation.

If your encryption key is turned on and the encryption key is being rotated, then your Amazon Redshift cluster is unavailable during this time. As a result, you receive the following error message:

"pg_query(): Query failed: SSL SYSCALL error: EOF detected"

The frequency of your key rotation depends on your environment's policies for data security and standards. Rotate the keys as often as needed or whenever the encrypted key might be compromised. Also, be sure to have a key management plan that supports both your security and cluster availability needs.

Too many active connections

In Amazon Redshift, all connections to your cluster are sent to the leader node, and there is a maximum limit for active connections. The node type determines the maximum quota that your Amazon Redshift cluster can support, not the node count.

When there are too many active connections in your Amazon Redshift cluster, you receive the following error message:

"[Amazon](500310) Invalid operation: connection limit "500" exceeded for non-bootstrap users"

If you receive an Invalid operation error when you connect to your Amazon Redshift cluster, then you reached the connection quota. To check the number of active connections for your cluster, review the DatabaseConnections metric in Amazon CloudWatch.

If you notice a spike in your database connections, then there might be a number of idle connections in your Amazon Redshift cluster. To check the number of idle connections, run the following SQL query:

select process, trim(a.user_name) as user_name, a.usesysid, a.starttime,
  datediff(s,a.starttime,sysdate) as session_dur, b.last_end,
 datediff(s,case when b.last_end is not null then b.last_end else
 a.starttime end,sysdate) idle_dur
     FROM
    (select starttime,process,u.usesysid,user_name
     from stv_sessions s, pg_user u
     where
     s.user_name = u.usename
      and u.usesysid>1
and process NOT IN (select pid from stv_inflight where userid>1
 union select pid from stv_recents where status != 'Done' and
  userid>1)
    ) a
     LEFT OUTER JOIN (select
 userid,pid,max(endtime) as last_end from svl_statementtext where
  userid>1 and sequence=0 group by 1,2) b ON a.usesysid = b.userid AND
 a.process = b.pid
    WHERE (b.last_end > a.starttime OR b.last_end is null)
    ORDER BY idle_dur;

The output looks similar to the following example:

 process | user_name  | usesysid |      starttime      | session_dur | last_end | idle_dur
---------+------------+----------+---------------------+-------------+----------+----------
   14684 | myuser     |      100 | 2020-06-04 07:02:36 |           6 |          |        6
(1 row)

When the idle connections are identified, use the following command syntax to shut down the connection:

select pg_terminate_backend(process);

The output looks similar to the following example:

pg_terminate_backend ----------------------
                    1
(1 row)

High CPU utilization of the leader node

All clients use a leader node to connect to an Amazon Redshift cluster. High CPU utilization of the leader node can result in intermittent connection issues.

If you try to connect to your Amazon Redshift cluster and the leader node is using high CPU, then you receive the following error message:

"Error setting/closing connection"

To confirm if your leader node reached high CPU utilization, check the CPUUtilization metric in Amazon CloudWatch. For more information, see Amazon Redshift metrics.

Client-side connection issues

Check for a connection issue between the client, such as Workbench/J or PostgreSQL, and your Amazon Redshift cluster. If your client tries to send a request from a port that was released, then a client-side connection reset might occur. As a result, the connection reset can cause intermittent connection issues.

To prevent these client-side connection issues, take one of the following actions:

  • Use the keepalive feature in Amazon Redshift to check that the connection between the client and server correctly operate. The keepalive feature also helps to prevent any connection links from being broken. To check or configure the values for keepalive, see Change TCP/IP timeout settings and Change DSN timeout settings.
  • If your queries appear to be running but stop responding in the SQL client tool, then check the maximum transition unit (MTU). The queries might fail to appear in Amazon Redshift because of a packet drop. A packet drop occurs when there are different MTU sizes in the network paths between two IP hosts. For more information about how to manage packet drop issues, see Queries appear to stop responding and sometimes fail to reach the cluster.
AWS OFFICIAL
AWS OFFICIALUpdated 21 days ago
3 Comments

I faced a similar issue with redshift cluster connection issue. This article really helped with troubleshooting and understanding the root cause. Referred to "Restricted access for a particular IP address or CIDR block" line by line and was able to troubleshoot the issue. Great Read :)

AWS
replied 2 months ago

select process, trim(a.user_name) as user_name, a.usesysid, a.starttime, datediff(s,a.starttime,sysdate) as session_dur, b.last_end, datediff(s,case when b.last_end is not null then b.last_end else a.starttime end,sysdate) idle_dur FROM (select starttime,process,u.usesysid,user_name from stv_sessions s, pg_user u where s.user_name = u.usename and u.usesysid>1 and process NOT IN (select pid from stv_inflight where userid>1 union select pid from stv_recents where status != 'Done' and userid>1) ) a LEFT OUTER JOIN (select userid,pid,max(endtime) as last_end from svl_statementtext where userid>1 and sequence=0 group by 1,2) b ON a.usesysid = b.userid AND a.process = b.pid WHERE (b.last_end > a.starttime OR b.last_end is null) ORDER BY idle_dur;

yuvasub
replied 24 days ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 23 days ago