Endpoint using expiring certificate
Issue: Always On Availability Group endpoints can be configured to authenticate connections between replicas using either Windows authentication or certificate-based authentication. Certificate authentication is the standard approach in workgroup environments, in deployments that span Active Directory domain boundaries, or in any scenario where Windows authentication between the SQL Server service accounts is not practical.
When certificates are used, each replica has its own certificate that authenticates outbound connections, and it trusts the certificates of its partner replicas through copies imported into the master database. Like all SQL Server certificates, these certificates have an expiry_date property recorded in sys.certificates.
This finding identifies AG endpoints configured with certificate authentication where the certificate has an expiration date set, particularly those approaching expiration.
Problem: For endpoint authentication, SQL Server does check certificate expiration during the authentication handshake, so an expired certificate can prevent replicas from establishing connections to each other. The behavior is different from TDE certificate expiration (which only produces a warning), and it can cause genuine outages of AG synchronization once the certificate has expired.
The most likely operational impact is that AG synchronization stops working at the moment the certificate expires. Existing connections may continue for some time, but new connection attempts after expiration fail authentication, which can cascade into the AG appearing as not synchronizing, transaction log growth on the primary, and inability to fail over cleanly. The cause may not be obvious without specifically checking certificate expiration on each replica.
Certificate expiration is also frequently overlooked because endpoint certificates are not managed through the same processes as web server certificates or other application-level certificates. They live entirely inside SQL Server, and unless someone has deliberately added them to a tracking system, the expiration date can pass without anyone noticing until the AG breaks.
Some organizations also have policies that require certificate rotation on a defined schedule for security reasons, regardless of whether a specific service enforces expiration. Long-lived certificates increase the impact of a potential certificate compromise, so rotating them periodically is a reasonable security practice in addition to the operational requirement of replacing them before they expire.
Learn More...Listener offline
Issue: An Always On Availability Group listener is a virtual network name and IP address that clients use to connect to the AG without needing to know which replica is currently the primary. The listener is implemented as a Windows Server Failover Clustering resource that is brought online on the node hosting the current primary replica, allowing client connections to follow the primary through failovers automatically.
The listener has a state managed by the cluster, and under normal operation it should be online on the node hosting the primary replica. When the listener resource is offline on every node in the cluster, no client can reach the AG through the listener name, regardless of which replica is currently primary.
This finding identifies AGs where the listener is offline cluster-wide.
Problem: A listener that is offline on all nodes makes the AG inaccessible to any client connecting through the listener name. Applications that use the listener in their connection strings cannot reach the database, even if the underlying replicas are healthy and synchronizing normally. The AG itself may be working fine internally while clients see complete connection failure.
The condition often masks the actual extent of the problem. Reviews of replica state may show all replicas as healthy and synchronized, suggesting that the AG is operational, while applications and users experience the AG as completely down. Without checking the listener state specifically, the cause of the reported outage can be hard to pinpoint.
Listener offline conditions usually result from cluster-side issues, including the listener IP address being unavailable (DHCP problems, IP conflicts, network configuration errors), DNS registration failures preventing the listener name from resolving, cluster network resource problems, or a manual offline action that was not reversed. Each requires a different remediation, and identifying the cause typically requires looking at cluster events rather than SQL Server events.
The condition is also worth attention because it represents a scenario where the AG has lost a critical part of its high availability promise. The whole point of the listener is to abstract clients from the underlying replicas, and an offline listener defeats that abstraction. Even if a manual failover is performed, applications cannot benefit from it because they cannot reach the listener.
Learn More...Availability group backup preference
Issue: Always On Availability Groups have an AUTOMATED_BACKUP_PREFERENCE setting that indicates which replica should service automated backup jobs for databases in the group. The available values are PRIMARY, SECONDARY_ONLY, SECONDARY (prefer secondary), and NONE (any replica), and the default when the AG is created through the wizard is SECONDARY.
This finding identifies AGs configured with a backup preference other than PRIMARY. The setting itself does not enforce where backups run; backup jobs evaluate it through the function sys.fn_hadr_backup_is_preferred_replica, which they then honor to decide whether to execute or skip.
The preferred backup location is a meaningful design decision that affects backup completeness, recovery point objectives, and licensing.
Problem: Backups taken on a secondary replica have important limitations. Differential backups are not supported on secondaries, and full backups must be taken as COPY_ONLY, which means they cannot serve as the base for a differential restore chain. If your recovery strategy depends on differentials, the backups taken on a secondary replica do not support that strategy and a primary based full backup is still required.
Backups on a secondary also depend on the replica being healthy and current. If synchronization falls behind on the secondary running backups, the resulting backups represent older data than the primary, and your recovery point can silently drift further than expected without producing any failure alerts. The job appears successful, but it does not meet the actual recovery objective.
Licensing also factors in. Backups on a secondary require that secondary instance to be licensed, even when its sole role is to offload backups, so the cost benefit of offloading depends on the existing license posture. In environments where the secondary is licensed only for failover under software assurance, taking active backups there may exceed the passive use rights and create a compliance issue.
For these reasons, PRIMARY preference is the simplest, most predictable option. Backups always run on the primary regardless of failover, full and differential support is preserved, and licensing is straightforward since the primary is always actively licensed.
Learn More...Availability group session timeout
Issue: Each replica in an Always On Availability Group has a session timeout value, configured per replica through the SESSION_TIMEOUT option, which controls how long the replica waits for a response from another replica before declaring the connection lost. The default value is 10 seconds, set when the replica is created and rarely changed unless explicitly tuned.
When a replica does not receive a ping or other expected message from its partner within the session timeout window, SQL Server treats the connection as failed and may take action depending on the AG's failover mode and availability mode. In synchronous commit configurations with automatic failover, this can include initiating a failover.
This finding identifies replicas configured with a session timeout of 10 seconds or less, which is the default and lower than what we recommend for stable AG operation in most environments.
Problem: A 10 second session timeout is sensitive to brief network disruptions that are common in real environments. Network blips, switch reconfigurations, virtual machine vMotion or live migration events, and similar transient conditions can easily exceed 10 seconds without indicating any actual failure of the replica or the underlying host.
When the session timeout is exceeded, the AG may initiate an unnecessary failover, drop a secondary out of synchronization, or generate a flurry of error log entries that obscure other issues. In automatic failover configurations, this can produce an unplanned role change in response to a network event that would have resolved on its own within a few additional seconds.
The resulting failover then triggers application reconnections, listener target changes, and the operational overhead of returning the AG to its desired state, all of which is disruption that was not actually warranted by an underlying failure.
In our experience, raising the session timeout to 15 seconds or higher reduces the rate of false positive failures and unnecessary failovers without meaningfully delaying response to genuine outages. The tradeoff is small (a few extra seconds before reacting to a real failure) compared to the disruption avoided from transient network events.
Learn More...Availability group data movement
Issue: In an Always On Availability Group, data movement is the process by which transaction log records flow from the primary replica to each secondary replica, keeping the secondary databases synchronized with the primary. SQL Server tracks the data movement state for every database on every replica, and this state can be either resumed (the normal operating state) or suspended.
This finding identifies AG databases where data movement has been suspended on at least one replica, meaning the secondary copy is no longer receiving updates from the primary. The condition can result from a manual ALTER DATABASE ... SET HADR SUSPEND command, automatic suspension by SQL Server in response to specific errors, or operations such as restoring a backup of a different database with the same name on the secondary.
The result is that the affected secondary copy is increasingly out of sync with the primary for as long as the suspension persists.
Problem: A suspended secondary cannot be used for failover until data movement is resumed and the secondary catches up. If the primary fails while a secondary is suspended, automatic failover to that secondary is not possible, and a manual failover would result in significant data loss equal to whatever transactions occurred since the suspension began.
A suspended secondary also affects the transaction log on the primary. Log records that have not yet been delivered to all secondaries cannot be truncated, so the primary's log file can grow continuously while the suspension persists. If the suspension is not resolved promptly, the primary's log volume can fill, causing the entire AG to stop accepting writes.
Backup strategies that depend on secondary replicas are also disrupted. If the AG backup preference points to a secondary that is suspended, backup jobs may either skip entirely or run against an out of date copy of the database, producing recovery points that do not reflect actual current data.
Learn More...Replica not joined
Issue: When a database is added to an Always On Availability Group, it must be joined on each secondary replica before that replica can participate in synchronizing the database. The join step is a separate operation from creating the database copy on the secondary, and it tells SQL Server on the secondary to begin tracking the database as part of the AG and to start receiving log records from the primary.
A database can be in the AG configuration but not yet joined on a particular secondary, typically as a transient state during initial setup, after a manual operation that removed and re-added the database, or after a failure scenario that interrupted the join process. The unjoined database appears in the primary's AG configuration but is not actively synchronizing on the affected secondary.
This finding identifies AGs where one or more databases are not currently joined on at least one secondary replica.
Problem: A database that is not joined on a secondary replica is not protected by that replica. The database appears in the AG configuration, but the secondary holds no current copy of the data, so it cannot serve as a failover target for that specific database. If the primary fails, the database is unavailable until the join is completed on a remaining replica.
The condition also produces inconsistent AG behavior. Some databases in the AG fail over correctly, while the unjoined databases do not, leading to a situation where part of the workload survives the failover while another part is lost or unavailable. This inconsistency is often more confusing for application teams and end users than a clean total outage would be, since it produces partial application failures that are harder to diagnose.
Learn More...Availability group state change
Issue: Each replica in an Always On Availability Group has a current operational state, including values such as ONLINE, OFFLINE, PRIMARY, SECONDARY, RESOLVING, and various synchronization states. SQL Server records state transitions in the AlwaysOn_health Extended Events session and in the SQL Server error log, providing a history of when and how each replica's state has changed.
This finding identifies AGs where one or more replicas have undergone a state change within the last 24 hours. The check is based on entries in the AlwaysOn_health session and the error log that record role changes, synchronization state changes, and connectivity events.
A recent state change is not necessarily a problem, but it warrants review to confirm whether the change was expected or whether it indicates an underlying issue that needs attention.
Problem: A state change can result from several different events, each with different implications. A planned failover, performed for maintenance or to test recovery procedures, is a normal operation and the resulting state change is expected. An unplanned failover triggered by health detection or connectivity loss indicates that something disrupted the AG, which deserves investigation even if the AG recovered automatically.
Brief connectivity issues between replicas can produce state changes without an actual failover, particularly when session timeouts are short or when the network is intermittently unstable. Repeated state changes within a short window often indicate ongoing network instability, virtualization layer issues such as host live migration, or a partner replica that is struggling with its own resource pressure.
State changes also affect application connectivity. Applications connected through the AG listener may experience reconnection attempts, transaction rollbacks, or temporary failures during the change, even when the resulting state is healthy. Repeated state changes can cause user visible disruption that is not obvious from the AG dashboard once the AG has stabilized.
Learn More...Trace flag 1800
Issue: Trace flag 1800 is enabled globally.
Problem: This trace flag helps when disks of different sector sizes are used for primary and secondary replica log files. However, this trace flag is only required to be enabled on SQL Server instances with transaction log file residing on disk with sector size of 512 bytes, and isn't required to be enabled on disk with sector sizes larger than 4 KB.
Learn More...Availability group failover mode
Issue: Each replica in an Always On Availability Group is configured with a failover mode, set as either AUTOMATIC or MANUAL. Automatic failover allows SQL Server to fail the AG over to a secondary without administrator intervention when health conditions warrant, while manual failover requires a deliberate action by an administrator to initiate the role change.
Failover mode is closely tied to the availability mode of each replica. Automatic failover is only available between replicas configured with SYNCHRONOUS_COMMIT availability mode, since automatic failover requires that the secondary be guaranteed up to date. Asynchronous replicas always require manual failover.
This finding identifies AGs or instances where the failover modes are inconsistent, either with different replicas in the same AG using different failover modes, or with replicas hosted on the same instance configured differently across multiple AGs.
Problem: Inconsistent failover modes within an AG can produce unexpected behavior during a failure event. If the primary fails and the only synchronous replica is configured for manual failover while another replica is configured for automatic, the actual failover behavior depends on which secondary the cluster considers eligible at the moment of failure, which can lead to longer outages than the team expects.
Inconsistent modes across multiple AGs on the same instance create operational complexity that is easy to misjudge during incidents. An administrator responding to a failure may correctly invoke automatic failover behavior for one AG while an adjacent AG on the same host requires manual intervention, extending the overall recovery time and increasing the chance of mistakes under pressure.
The configuration can also indicate that the AG topology has changed from its original design. AGs are typically designed with a deliberate failover model (high availability with automatic failover, disaster recovery with manual failover, or a mix), and inconsistent settings often reflect changes made over time without revisiting the overall design.
In some cases the inconsistency is intentional, such as a multi-region AG where the local replicas are configured for automatic failover within the primary site while the remote disaster recovery replica is manual. The condition is worth reviewing in either case to confirm the configuration is deliberate and matches the documented availability strategy.
Learn More...Trace flag 9567
Issue: Trace flag 9567 is enabled globally.
Problem: This trace flag enables compression of the data stream for availability groups during automatic seeding. This compression can significantly reduce the transfer time during automatic seeding, but also increases the load on the processor.
Learn More...About sp_checks
This page contains a list of SQL Server configuration checks performed by Straight Path's suite of sp_check tools. For more details about our free tools, select one from the following list: