Skip to content

SQL Server Instance Unresponsive

0

In the middle of the night, naturally, we ran into an incident where our main SQL Server seemingly became unreachable. There was a base level (below expected) number of connections that remained static and a likely higher level of sessions reported than usual.

During this time, because it was not in an 'Available' state this entire time we couldn't reboot. The AWS backup job also wouldn't let us cancel the job. Apps reported that they could not connect, I couldn't connect, and the error log was seemingly stuck. Cloudwatch has a gap where not even basic metrics were reported and Performance Insights follows that same gap.

We ended up updating our SG rules to completely block the instance, including the Windows Server Failover Cluster port. Shortly after AWS detected an issue with the instance and initiated a failover which largely resolved the issue.

Other than stumbling into that fuse, I don't know what else I was supposed to do to regain access to a critical instance besides wait for the backup to complete which, once it started again, took the better part of a business day.

Any ideas? Has anyone seen this with MS SQL Server before?

asked 2 years ago498 views
2 Answers
0

Check the maximum number of connections you system has and monitor resource throttling. Implementing MultiAZ will give you a hand (Costing) to take over any issues like this, if no underelying problems exist.

Look at this post also https://repost.aws/knowledge-center/rds-cannot-connect

EXPERT
answered 2 years ago
-1

1. Investigate Logs:

  • SQL Server Error Logs: Check the SQL Server error logs for any indications of what might have caused the unresponsiveness. Look for errors or unusual messages around the time the issue started.
  • Windows Event Logs: Review the Windows Event Logs for any system-level issues or errors that coincide with the SQL Server's unresponsive state.

2. Check for Blocking or Deadlocks:

  • High levels of blocking or a deadlock situation can cause SQL Server to become unresponsive. Use tools like SQL Server Management Studio (SSMS) or T-SQL scripts to investigate blocking and deadlocks.
  • You can run scripts like:
SELECT * FROM sys.dm_exec_requests WHERE blocking_session_id <> 0;

3. Resource Bottlenecks:

  • Check for resource bottlenecks such as CPU, memory, or I/O. However, as you mentioned there was a gap in CloudWatch metrics, this may indicate an issue at the host or AWS infrastructure level.

4. Cluster and Failover:

  • Review the configuration and health of the Windows Server Failover Cluster (WSFC) if you are using SQL Server Always On Availability Groups.
  • Ensure that the cluster nodes and network configuration are healthy and that the cluster service is operating correctly.

Preventive Measures

1. Monitoring and Alerts:

  • Enhance monitoring to get detailed insights before the instance becomes completely unresponsive. Use tools like AWS CloudWatch, SQL Server Profiler, or third-party monitoring tools (e.g., SolarWinds, Redgate) for proactive monitoring.
  • Set up alerts for unusual activities or resource consumption patterns that could indicate potential issues.

2. Backups and Maintenance:

  • Schedule backups during low-usage periods to avoid conflicts with peak operational times.
  • Regularly test backups to ensure they can be restored quickly in case of failure.

3. SQL Server Configuration:

  • Ensure SQL Server is properly configured for your workload. Review settings like max server memory, max degree of parallelism (MAXDOP), and tempdb configuration.
  • Consider using Resource Governor to manage workloads and prevent runaway queries from consuming all resources.

4. AWS RDS Recommendations:

  • If you are using AWS RDS for SQL Server, consider enabling Multi-AZ deployments for high availability and automated failovers.
  • Regularly review the recommendations provided by AWS Trusted Advisor and RDS Performance Insights.

Emergency Response Plan

1. Failover Mechanism:

  • Have a clear failover mechanism and procedures documented and tested. Ensure that your team knows when and how to initiate a manual failover if automatic failover mechanisms do not trigger.
  • In the case of unresponsiveness, knowing how to quickly trigger a failover can minimize downtime.

2. Service Interruption Protocol:

  • Have a protocol in place for handling service interruptions, including communication plans with stakeholders, escalation procedures, and a checklist for diagnosing issues.

3. Runbooks:

  • Maintain detailed runbooks that cover troubleshooting steps for common issues, including unresponsiveness, high resource utilization, and failover procedures.
EXPERT
answered 2 years ago
EXPERT
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.