sp_check: SQL Server Database Checks

0 - Information Only

Database offline

Issue: A SQL Server database can be in one of several states, including ONLINE, OFFLINE, RESTORING, RECOVERING, SUSPECT, and EMERGENCY. The OFFLINE state means the database has been deliberately taken offline by an administrator using ALTER DATABASE [DatabaseName] SET OFFLINE; and is not accessible to any users or applications.

This finding flags databases currently in the OFFLINE state, which warrants review to confirm the offline status is intentional and still appropriate. Unlike SUSPECT or RECOVERY_PENDING, the OFFLINE state is the result of a manual action rather than a failure.

Problem: An offline database is completely unavailable, so any application, report, or process that depends on it will fail. While administrators sometimes take databases offline intentionally for maintenance, migration, or temporary isolation, an offline database that has been forgotten consumes storage and licensing without providing any value.

Forgotten offline databases also create operational risk. They are excluded from routine maintenance such as integrity checks, backups, and index maintenance, so when someone eventually brings them back online, they may have stale statistics, accumulated corruption that was never caught, or be missing transaction log data that prevents proper recovery.

Offline databases also clutter inventories and obscure the real state of the environment, making it harder to know which databases are truly needed and which can be retired.

Learn More...

Backup to NUL

Issue: NUL is a virtual device in Windows that discards anything written to it. When SQL Server is told to back up a transaction log to NUL using a command such as BACKUP LOG [DatabaseName] TO DISK = 'NUL';, the engine performs all the operations of a real log backup, including marking log records as backed up and allowing the log to be truncated, but the resulting backup file is silently thrown away.

This finding indicates that one or more databases have had a recent transaction log backup directed to the NUL device. Microsoft explicitly documents that NUL should not be used in production environments and is intended only for testing backup performance.

Problem: A log backup to NUL breaks the recovery chain. SQL Server records the backup as having occurred and advances its internal log sequence numbers, but the actual data needed for recovery is gone, meaning point in time restore is no longer possible from any prior log backup forward.

This is particularly dangerous because the database appears healthy from a backup history perspective. Reports and monitoring tools see recent log backups in msdb.dbo.backupset and assume the database is protected, when in reality there is nothing to restore. The damage is silent until a recovery is actually attempted, at which point the gap in the log chain becomes a disaster.

NUL backups are also commonly seen as a workaround for log files that grew out of control, used to clear the log without taking the time or storage to capture a real backup. While this fixes the immediate space issue, it leaves the database without a valid recovery path until a new full backup is taken.

Learn More...

Virtual log files

Issue: SQL Server divides every transaction log file into smaller internal segments called Virtual Log Files (VLFs). The number and size of VLFs created during a log file growth depends on the size of the growth increment: small growths produce many small VLFs, while large growths produce fewer, larger VLFs.

When a log file has been grown many times in small increments, often through default autogrowth settings or after frequent shrink and grow cycles, the total VLF count can climb into the thousands. This finding flags databases where the number of VLFs in the transaction log exceeds 200, which is our recommended threshold for investigation.

Problem: A high VLF count adds overhead to nearly every operation that touches the transaction log. Log backups slow down because SQL Server must process each VLF individually, transaction log replay during recovery takes longer (extending crash recovery and Always On failover times), and database startup itself can become noticeably slower.

The performance impact compounds during disaster recovery. A database that takes seconds to come online with a properly sized log can take many minutes or longer with thousands of VLFs, directly extending downtime when it matters most.

Excessive VLFs also indicate poor log management practices, typically a combination of small autogrowth increments, undersized initial log files, and possibly repeated shrink operations. The VLF count itself is a symptom, but the underlying configuration usually causes other related problems including log growth pauses and unpredictable performance.

Learn More...

Backup history not purged

Issue: SQL Server records every backup and restore operation in a set of history tables in the msdb system database, including backupset, backupfile, backupmediafamily, backupmediaset, restorehistory, and related tables. By default, this history accumulates indefinitely, meaning every full, differential, and log backup for every database stays in msdb forever unless explicitly purged.

This finding indicates that backup history in msdb has never been cleaned up or contains rows dating back years. On busy instances with frequent log backups, this history can grow into millions of rows over time.

Problem: A bloated msdb database causes several practical problems. The database itself grows large, consuming storage and increasing backup size and duration for msdb. Queries against backup history slow down significantly, which affects monitoring tools, backup verification scripts, and the SSMS Restore Database dialog (which can take many minutes to open on instances with severe history bloat).

Maintenance operations on msdb such as integrity checks and index maintenance also take longer, and the size of msdb can complicate disaster recovery procedures since msdb must be restored or rebuilt as part of full instance recovery.

Beyond performance, retaining years of backup history provides little operational value. Backup history older than your retention period cannot be used for restores anyway because the underlying backup files are long gone, so the rows simply consume space without serving any purpose.

Learn More...

Tempdb encrypted

Issue: SQL Server's Transparent Data Encryption (TDE) protects user databases by encrypting data at rest, including data files, log files, and backups. When TDE is enabled on any single user database on an instance, SQL Server automatically encrypts tempdb as well, since user data may flow through tempdb during query execution and would otherwise be exposed there.

This finding identifies instances where tempdb is encrypted, which is always the result of TDE being enabled on at least one user database. The encryption of tempdb itself is not configurable independently and cannot be turned off while TDE is active on any database on the instance.

This is not necessarily a problem, but it is worth noting in case the team is unaware that encrypted user databases exist on the instance, or unaware of the performance implications that come with tempdb encryption.

Problem: Encrypting tempdb has a measurable performance cost. Every page written to or read from tempdb must be encrypted and decrypted, which adds CPU overhead to operations that use tempdb heavily, including sorts, hash joins, spills, temporary tables, table variables, and version store activity for snapshot isolation or read committed snapshot. The performance impact is most noticeable on workloads with high tempdb activity. In some cases, the additional CPU cost can be significant enough to justify reviewing query patterns or workload distribution, particularly if tempdb performance was already a bottleneck.

There is also a subtle operational consideration: once tempdb is encrypted, it stays encrypted until TDE is removed from every user database on the instance. Disabling TDE on a single database does not revert tempdb to an unencrypted state, so the decision to use TDE has lasting effects on the entire instance.

Also, the presence of TDE means certificate management is critical. The TDE certificate from the master database must be backed up and stored securely, since the loss of that certificate makes every encrypted database (and any backups of those databases) permanently unrecoverable.

Learn More...

Tempdb file with max size set

Issue: Every SQL Server database file has a maximum size setting that controls how large the file can grow. By default, this is set to UNLIMITED, allowing the file to grow until it fills the underlying volume. When a maximum size is configured, SQL Server stops growing the file once it reaches that limit and returns errors on operations that need additional space.

This finding indicates that one or more tempdb data or log files have a maximum size value set rather than UNLIMITED. The setting is sometimes applied as a safety measure to prevent tempdb from filling a drive, but it introduces operational risk on a database that is critical to the entire instance.

Problem: tempdb is shared by every database and every session on the SQL Server instance and is used for sorts, hash joins, spills, temporary tables, table variables, snapshot version store, and many internal operations. When tempdb runs out of space, queries fail with error 1105, sessions are disrupted, and in severe cases the entire instance can become unstable until space is freed.

Setting a maximum size on tempdb files creates an artificial ceiling that triggers these failures even when free space is still available on the drive. A single large query, an unexpected sort spill, or a long running transaction holding the version store open can hit the cap and cause widespread query failures across all databases on the instance.

The intended safety benefit (preventing tempdb from filling the drive) is better achieved through proper drive sizing and monitoring than through a hard size limit. A capped tempdb shifts the failure from a drive level event to a query level event, but the failure still occurs and is often more disruptive because it surfaces as application errors during normal operation.

Learn More...

Health check timeout

Issue: The Health Check Timeout is a Windows Server Failover Clustering (WSFC) setting that controls how long the cluster waits for a response from the SQL Server resource DLL during health checks before considering the resource unhealthy. The cluster periodically calls into SQL Server to confirm the instance is responsive, and if SQL Server does not respond within the timeout, the cluster may take action including failover.

The default value is 30000 milliseconds (30 seconds), which sounds generous but can be insufficient on busy instances or during transient resource pressure. The setting is configured on the SQL Server cluster resource and applies to the health check mechanism for both Failover Cluster Instances and Always On Availability Group resources.

This finding identifies clusters where the Health Check Timeout is at or below the default of 30000 milliseconds.

Problem: A 30-second timeout can produce false-positive health check failures on instances under load. When SQL Server is busy with high CPU, memory pressure, or a long-running operation that briefly delays the resource DLL response, the cluster may interpret the delay as a genuine failure and initiate corrective action. The result is an unplanned failover that was not actually warranted by an underlying problem.

In our experience, the default value is low enough that brief, recoverable conditions can trigger unnecessary failovers, particularly on heavily-utilized instances or during specific operations such as large transaction commits, statistics updates, or extensive blocking chains. Each unnecessary failover causes application disruption, reconnection events, and the operational overhead of returning the cluster to its desired state, all without addressing any real problem.

Raising the Health Check Timeout to 45000 milliseconds or higher gives SQL Server a more realistic window to respond to health checks during transient pressure. The tradeoff is small (the cluster takes a few extra seconds to detect a genuine failure) compared to the disruption avoided from false-positive failovers. This is similar in spirit to the Availability Group session timeout finding, where modest increases over the default produce significant stability gains.

The condition often persists because the default is rarely reviewed during initial deployment. Clusters are configured with whatever the installer or template provides, and the setting only surfaces when unexplained failovers prompt deeper investigation. By that time, the team has often had multiple disruptive events that could have been avoided with a tuning adjustment.

Learn More...

Secondary lag

Issue: Always On Availability Groups maintain synchronization between replicas by streaming transaction log records from the primary to each secondary, where the secondary applies the records to its copy of the database through a process called redo. The time between when a transaction commits on the primary and when it is applied on the secondary represents the lag, with separate measurements available for log send latency (primary to secondary network) and redo latency (the secondary catching up its data files).

This finding identifies AGs where one or more replicas are exhibiting estimated secondary lag exceeding 5 seconds. The condition can apply to either synchronous-commit or asynchronous-commit replicas, with different implications depending on the configuration.

Problem: For asynchronous-commit replicas, lag is expected and is part of the design tradeoff. The asynchronous mode trades up-to-date secondaries for reduced impact on the primary, with the understanding that some lag is normal. However, sustained lag well beyond what the team has documented as acceptable may indicate problems that should be investigated, since significant lag at the moment of a primary failure translates directly into data loss equal to the lag duration.

For synchronous-commit replicas, lag indicates that the secondary is having trouble keeping up with the primary, which can begin to affect performance on the primary itself. Synchronous mode requires the primary to wait for the secondary to harden each transaction's log records, so a slow secondary directly slows down primary commit times. Sustained lag in synchronous mode is also one of the conditions that can drive the AG out of SYNCHRONIZED state, eliminating the secondary as a clean failover target.

Lag also affects readable secondary scenarios. If applications use secondary replicas for read-only queries, the data they see may be substantially behind the primary, which can produce stale reads, inconsistent reporting, and confusion when results from the primary and the secondary disagree. Reporting that depends on near-current data is particularly sensitive to extended lag.

The condition can also point to broader infrastructure issues. Network bandwidth or latency problems between replicas, slow storage on the secondary, resource pressure (CPU, memory) that is starving the redo process, or schema patterns that produce inefficient redo (heavy DDL, large transactions, parallel index builds) can all produce sustained lag. Addressing the root cause often improves performance for the whole AG, not just the lagging replica.

Learn More...

HADR_SYNC_COMMIT waits

Issue: HADR_SYNC_COMMIT is a SQL Server wait type that occurs on the primary replica of an Always On Availability Group when committing transactions on databases configured for synchronous-commit availability mode. The primary waits on this type while it sends the transaction's log records to each synchronous secondary and waits for confirmation that the log has been hardened to disk on those secondaries.

The wait time directly reflects the round-trip cost of synchronous replication, including network latency between replicas, log write performance on the secondaries, and any time spent in queues at either end. The wait is a normal and expected part of synchronous-commit operation, but the average duration is a useful indicator of how much synchronous replication is costing the workload.

This finding flags primaries where the average HADR_SYNC_COMMIT wait time has exceeded 5 milliseconds, which is a commonly accepted threshold for investigation.

Problem: Every committed transaction on a synchronous-commit database pays the HADR_SYNC_COMMIT cost. When the average wait climbs above a few milliseconds, every commit takes longer, which translates directly into increased response time for write-heavy workloads and reduced throughput for transactional operations. Applications that perform many small commits feel this most acutely.

Sustained high waits can indicate problems with the replication path or with the secondaries themselves. Network latency between replicas (often introduced when replicas live in different datacenters or cloud regions), slow storage on the secondaries causing delayed log hardening, or resource pressure on the secondary instances are all common causes. Each requires a different remediation, and the wait time alone does not identify which is at fault.

The condition can also indicate a topology mismatch. Synchronous-commit replicas should generally be physically and logically close to the primary so the round-trip is fast. Replicas located far away or on slower infrastructure are typically better configured for asynchronous-commit, accepting some recovery point objective tradeoff in exchange for not paying the latency cost on every commit.

Persistent elevated HADR_SYNC_COMMIT waits can also affect other parts of the system. Long commit waits hold transactional resources longer, which can cascade into blocking, increased lock contention, and elevated waits on other types. A small synchronization problem can produce broad performance symptoms that are hard to trace without specifically looking at AG-related waits.

Learn More...

Replica not failover ready

Issue: Each replica in an Always On Availability Group has a synchronization state that reflects how current it is relative to the primary. The states include SYNCHRONIZED (the replica is fully caught up and committed transactions have been hardened on it), SYNCHRONIZING (the replica is processing log records but is not guaranteed to be fully current), NOT SYNCHRONIZING (the replica is not receiving updates), and REVERTING or INITIALIZING during specific transient operations.

For a synchronous-commit replica to be eligible for automatic failover or for a manual failover without data loss, it must be in the SYNCHRONIZED state. Replicas in any other state are not considered ready for failover, even if they appear to be receiving updates and processing them normally.

This finding identifies AGs with failover mode AUTOMATIC have one or more replicas are in a state other than SYNCHRONIZED, indicating that the AG cannot fail over cleanly to those replicas at the current moment.

Problem: A replica that is not synchronized cannot serve as a clean failover target. If the primary fails while a synchronous-commit replica is in SYNCHRONIZING state, the cluster may not initiate automatic failover to that replica, and a manual failover would either fail or proceed only with the data loss option, accepting whatever transactions had not yet been committed on the secondary.

For asynchronous-commit replicas, the synchronization state is expected to lag behind the primary, but the replica should still be SYNCHRONIZING rather than stuck in another state. A replica that is NOT SYNCHRONIZING is genuinely disconnected from the primary's data flow and is becoming further out of date with every transaction that commits on the primary.

The condition also indicates that the high availability promise of the AG is not currently being met. The whole point of having multiple replicas is to provide failover targets, and a replica that is not in a failover-ready state is not contributing to availability. If the primary fails during this window, the recovery experience is significantly worse than the AG's design implied.

Learn More...

About sp_checks

This page contains a list of SQL Server configuration checks performed by Straight Path's suite of sp_check tools. For more details about our free tools, select one from the following list: