Skip to content

How do I troubleshoot communication issues in SQL Server failover clusters on Amazon EC2 instances?

10 minute read
0

I want to troubleshoot communication issues between nodes in my Windows Server Failover Cluster (WSFC) that runs Microsoft SQL Server on Amazon Elastic Compute Cloud (Amazon EC2) instances.

Short description

Communication issues occur in WSFCs on Amazon EC2 instances for the following reasons:

  • Network connectivity issues
  • Misconfigured elastic network interfaces
  • Incorrect IP address assignments
  • Missing security group rules
  • SQL Server connection management issues or transient network events

Resolution

Prerequisites:

  • Note the time that your issue started.
  • Determine whether the communication issue affected a single node or the entire cluster.
  • Note recent changes such as Windows updates, reboots, antivirus changes, or application deployments.

Review cluster event logs in Event Viewer

To identify the type of communication issue, complete the following steps:

  1. In the Windows search field, enter Event Viewer, and then open it.
  2. Choose the Windows Logs dropdown list, and then choose System.
  3. In the Actions navigation pane, choose Filter Current Log.
  4. For Event sources, enter FailoverClustering, and then choose OK.
  5. Look for the following critical event IDs:
    1135: Node removed from active membership because of network or communication issue.
    1127: Cluster network interface failed.
    1146: Resource control manager detected a resource failure.
    1230: Cluster network not operational.
    1130: Cluster network down. Can't communicate with other nodes.

If event IDs 1069, 1146, 1230, or 1135 appear, then proceed to the Generate and analyze cluster logs section of this article. For more information about event ID 1135, see Troubleshoot cluster issue with event ID 1135 on the Microsoft website.

Generate and analyze cluster logs

Cluster logs provide detailed information about cluster operations, failovers, and communication failures. By default, Microsoft stores logs in the C:\Windows\Cluster\Reports folder.

To generate and analyze your cluster logs, complete the following steps:

  1. Open Windows PowerShell as an administrator.

  2. To generate logs with local time stamps, run the following command:

    Get-ClusterLog -UseLocalTime
  3. (Optional) To view a specific time range within the logs, run the following command:

    Get-ClusterLog -TimeSpan minutes

    Note: Replace minutes with the number of minutes before the current time.

  4. (Optional) To increase log verbosity for detailed troubleshooting, run the following command:

    Set-ClusterLog -Level 5

    Important: To avoid performance issues, reset the log level to the default value of 3 after you troubleshoot your communication issue.

  5. To identify the failover type, review the generated cluster log file for the following keywords:
    MoveType::Manual shows a manual failover
    MoveType::Failover shows an automatic failover
    MoveType::Failback shows a failback operation

  6. Review your logs for network-related error messages that are similar to the following ones:
    "Channel to remote endpoint has failed"
    "Heartbeat counter/threshold" entries that show values that approach or exceed the threshold
    "Network connectivity failure" messages between specific nodes

Review cluster configuration using PowerShell

To check the cluster state and configuration, run the following commands in Windows PowerShell:

Get-Cluster | fl *        # Display cluster name and all properties
Get-ClusterNode           # List all node names and their status
Get-ClusterGroup          # Show clustered roles and their state
Get-ClusterNetwork        # Display network configuration
Get-ClusterQuorum         # Show quorum configuration
Get-ClusterResource       # List all cluster resources and their state

To check cluster heartbeat settings, run the following command:

Get-Cluster | fl *subnet*

Review the following parameters:

  • CrossSubnetDelay: The delay in milliseconds between heartbeat messages that are sent between nodes in different subnets. The default value is 1000.
  • CrossSubnetThreshold: The number of missed heartbeats before the cluster considers a node down in cross-subnet scenarios. The default value is 20.
  • SameSubnetDelay: The delay in milliseconds between heartbeat messages that are sent between nodes in the same subnet. The default value is 1000.
  • SameSubnetThreshold: The number of missed heartbeats before the cluster considers a node down in same-subnet scenarios. The default value is 20.

Check your network interface and IP address configuration

Make sure that you correctly configured your network so that each cluster node can communicate with other nodes and serve SQL Server traffic.

Verify that each node has one network interface with the following private IP addresses assigned:

  • A primary IP address to manage host network traffic
  • A cluster IP address for WSFC heartbeat communication
  • A SQL Server IP address to manage FCI listener traffic

To check your network interface and IP address configuration, complete the following steps:

  1. Open the Amazon EC2 console.
  2. In the navigation pane, choose Network Interfaces.
  3. Select each cluster node's network interface and confirm that you assigned the secondary private IP addresses.
  4. Confirm that you configured the operating system (OS) network interface controller (NIC) to use Dynamic Host Configuration Protocol (DHCP).
    Note: To prevent communication failures, don't assign static IP addresses at the OS level.

If you deployed nodes in different Availability Zones or subnets, then you must configure the cluster as a multi-subnet cluster. For more information, see Best practices and recommendations for SQL Server clustering on Amazon EC2.

Check your security group rules

Complete the following steps:

  1. Open the Amazon EC2 console.
  2. In the navigation pane, choose Security Groups.
  3. Select the security group that's attached to your cluster nodes.
  4. Verify that you configured the following inbound rules:
    TCP and UDP Port 3343
    TCP Port 1433
    TCP Port 5022
    TCP Port 445
    Internet Control Message Protocol for IPv4 (ICMPv4) for all ICMP types
  5. Verify that outbound rules allow all traffic or allow traffic to the same security group.

Troubleshoot the network connection between nodes

If cluster logs show network communication failures, then take the following actions to troubleshoot the issue.

Note: For the following commands, replace target-node-ip with your target node's IP address.

From one cluster node, run the following command to test connectivity to another node:

ping -t target-node-ip

Note: Monitor for packet loss or high latency.

To identify network hops and potential latency issues, run the following command:

tracert target-node-ip

To test cluster-specific ports, run the following commands:

Test-NetConnection -ComputerName target-node-ip -Port 3343
Test-NetConnection -ComputerName target-node-ip -Port 1433

Note: In the output, verify that TcpTestSucceeded returns true for each port.

Troubleshoot SQL Server connection management issues

Extended failover times can occur when SQL Server takes too long to stop during failover operations. SQL Server might take too long because it has several inactive or active connections that it must remove during shutdown.

To identify SQL Server connection issues, complete the following steps:

  1. Review the SQL Server ERRORLOG file at C:\Program Files\Microsoft SQL Server\MSSQL.\MSSQL\Log\ERRORLOG.

  2. Look for the following error patterns during failover events:
    Error 18056: Client can't reuse sessions, connection pooling reset issues
    Error 18401: Login failures with the "State 23 - Access to server validation failed when revalidating the login on the connection" error message
    Multiple authentication revalidation failures for pooled connections
    Sessions in flight during shutdown

  3. To check the number of active connections before the failover, run the following query:

    SELECT
        DB_NAME(dbid) as DatabaseName,
        COUNT(dbid) as NumberOfConnections,
        loginame as LoginName
    FROM sys.sysprocesses
    WHERE dbid > 0
    GROUP BY dbid, loginame
    ORDER BY NumberOfConnections DESC;
  4. To identify inactive connections that are idle for extended periods, run the following query:

    SELECT
        session_id,
        login_name,
        host_name,
        program_name,
        status,
        last_request_start_time,
        DATEDIFF(MINUTE, last_request_start_time, GETDATE()) as IdleMinutes
    FROM sys.dm_exec_sessions
    WHERE is_user_process = 1
        AND status = 'sleeping'
    ORDER BY IdleMinutes DESC;
    

Before you perform planned failovers, remove inactive SQL Server connections from applications.

At the application level, take the following actions:

  • Implement connection management logic in applications to close SQL Server connections when not in use.
  • Confirm that applications return connections to the connection pool.
  • Set connection timeout values in application connection strings.

For connection pool tuning, take the following actions:

  • Tune the connection pool size for concurrent connections in your application configuration.
  • Set the idle connection timeout to a lower value, such as 5-10 minutes, to prevent inactive connection accumulation.
  • Configure Max Pool Size and Min Pool Size to appropriate values in connection strings.

Then, perform a test failover to check whether failover performance improved.

Verify Amazon FSx for Windows File Server configuration

If your SQL Server cluster uses Amazon FSx for Windows File Server for shared storage, then check your Amazon FSx configuration.

Complete the following steps:

  1. Open the Amazon FSx console.
  2. In the navigation pane, choos File systems.
  3. Verify that all file systems show the Available status.
  4. To verify that you activated Continuous Availability (CA) on Amazon FSx shares, run the following Windows PowerShell command from a cluster node:
    Invoke-Command -ComputerName fsx-powershell-endpoint -ConfigurationName FSxRemoteAdmin -ScriptBlock {
        Get-FSxSmbShare -Name share-name
    }
    Note: Replace fsx-powershell-endpoint with your Amazon FSx file system's Windows PowerShell endpoint. Also, replace share-name with the name of your SMB share. In the command's output, confirm that ContinuouslyAvailable is set to true.
  5. Verify that applications and SQL Server use DNS names to connect to Amazon FSx. Failover and failback operations work only with DNS name connections.
  6. To verify that DNS returns an IP address for each file server node, run the following Windows PowerShell command:
    Resolve-DnsName fsx-dns-name
    Note: Replace fsx-dns-name with your Amazon FSx file system's DNS name.

For more information about Amazon FSx high availability, see Availability and durability: Single-AZ and Multi-AZ file systems.

Check for transient network issues

If multiple clusters across different AWS accounts or AWS Regions experience simultaneous failovers, then network connectivity might be transient.

To troubleshoot network issues, complete the following steps:

  1. Note the time range when failovers occurred across multiple clusters.
  2. For affected nodes, review the cluster logs for network-related errors during the time range that the failovers occurred.
  3. If you suspect an AWS infrastructure issue, then open a support case from each account where the affected resources exist.
    Note: In your support case, provide the timestamp of the failover (in UTC), affected Regions, and cluster logs from the time range that the failovers occurred. Also, use MTR or tracert to provide network trace results between affected nodes.

Further troubleshoot

Take the following actions:

  • Verify that the Elastic Network Adapter (ENA) driver is up to date on all cluster nodes. For more information, see Enable enhanced networking with the ENA on your EC2 instances. Also, see Troubleshoot the Elastic Network Adapter Windows driver.
  • Temporarily turn off antivirus software to determine whether your software is blocking required ports or processes. If you turn off your antivirus and the issues resolves, then configure exclusions for cluster-related processes and ports.
  • Check the Windows update history on all nodes to verify that all nodes have the same patch level.
  • For frequent transient network issues, modify cluster heartbeat thresholds.
    To modify heartbeat thresholds, run the following commands:
    (Get-Cluster).SameSubnetThreshold = 40
    (Get-Cluster).CrossSubnetThreshold = 40
    
    Note: Increased thresholds might cause a delay in failover detection. The default threshold is 20 missed heartbeats for same-subnet configurations and 20 for cross-subnet configurations. Modify the values based on your network stability and failover time requirements. For more information, see Tune the failover thresholds.

Related information

Best practices and recommendations for SQL Server clustering on Amazon EC2

What is FSx for Windows File Server?

AWS OFFICIALUpdated a month ago