There are a ton of great resources out there teaching you how to build a Distributed Availability Group (DAG), but there aren’t many at all that help you troubleshoot them when they go bump. Fortunately for you (and unfortunately for me) I’ve had enough experience administering DAGs that I’d like to start to share the knowledge of the “How To” as it relates to troubleshooting.
We had an interesting case recently involving a client who uses a DAG for Disaster Recovery (DR) purposes where after they had done Windows patching on all of the replicas in the DAG, servers ceased to communicate with each other for no real good reason in the Global Primary AG. After they had attempted to do some less than desirable troubleshooting approaches themselves, they decided to then loop us in and give us a call.
To give some background on this topology, they have a DAG comprised of two individual AGs. Their Global Primary AG (we’ll call this AG1) has three replicas, while their Forwarder AG (we’ll call this AG2) has two replicas. The replicas in AG1 are all in the same subnet and all Azure VMs. The replicas in AG2 are all in their own same subnet and all Azure VMs.
By the time we got logged in and connected, the Global Primary Replica was online and was able to receive connections. The secondary replicas in the Global Primary AG however, were unable to communicate with the Global Primary Replica. This is problem 1. The other secondary problem is that several databases on the Forwarder Primary Replica were in crash recovery. This is problem 2. Considering problem 2 was happening in the DR environment, that was put aside for the time being.
Troubleshooting Steps
When looking at the SQL Server Logs on both of the secondary replicas in the Global Primary AG (AG1), we saw this message:
A connection timeout has occurred while attempting to establish a connection to availability replica ‘XXX’ with id [XXX]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
Given the wording of the message, here are the following items that were checked and verified on all replicas in AG1:
- The HADR endpoint was owned by sa.
- The account running SQL has CONNECT permissions to the HADR endpoint.
- The HADR endpoint was in a STARTED state.
- Port 5022 was opened on all replicas.
- Ran a Test-NetConnection with Port 5022 specified both to and from all replicas in AG1.
- The AG had the CREATE ANY DATABASE permissions (not really applicable for this type of issue but still checked nonetheless)
All of the above checked out and were set appropriately. This left us scratching our heads and really confused. Since every sign was pointing to the network at this point, we decided to pull the two secondary replicas from AG1 but keep them in the cluster in order to prevent issues with log reuse while this issue was being worked on.
The next day some of the networking guys were able to join the meeting and they validated that everything looked good from their end and nothing had changed. Amongst our troubleshooting efforts, Mike Walsh had suggested to check how the endpoint communication was set.
Small backstory break here – When these guys were migrating from on-prem to Azure, they had involved another 3rd party firm to help them get to the cloud (this did not involve much of our input at the time, parent companies and all..). They used the DAG to perform their migration. This is because they were not only going from an on-prem AG to the cloud (this can be done without a DAG anyways) but they were also going to a different domain. So, for the on-prem AG to communicate with the new AG in Azure, they had to use certificate-based authentication for endpoint communication. Standard protocol for something like that. Anyways, back to the show…
After Mike had suggested to check the method of endpoint communication using the below query:
SELECT name, state_desc, protocol_desc, connection_auth_desc
FROM sys.database_mirroring_endpoints;
we had found that the connection_auth_desc column returned CERTIFICATE, NEGOTIATE. I had remembered the little back story shared above and then checked on the certificate status on each of the replicas in AG1. It had turned out that the Expiry Date of the certificate used for communication during the initial migration had just expired a few days ago, and when SQL was restarted, it saw the expired certificate and said “Yeah this is expired, I’m not talking to any of these replicas anymore”. Even though the HADR endpoint had both connection authentication methods listed, it still detected an expired certificate used for endpoint communication and stopped working.
Solution
The solution here was to alter the endpoint and use Windows Auth (Negotiate) only ,so it no longer relies on certificate-based authentication. Here’s the query used to fix that (ran on all replicas in AG1):
ALTER ENDPOINT [Hadr_endpoint]
FOR DATABASE_MIRRORING
(
AUTHENTICATION = NEGOTIATE
);
GO
Then, we had to stop and start the HADR endpoint (again, ran on all replicas in AG1):
ALTER ENDPOINT [Hadr_endpoint] STATE = STOPPED;
GO
ALTER ENDPOINT [Hadr_endpoint] STATE = STARTED;
GO
When I added the replicas back to AG1, they were communicating as expected again, and I just used automatic seeding to get the databases back in sync in AG1 since I removed the replicas previously when this issue started.
Now, there was a second problem to address. On the Forwarder AG (AG2) in the DAG, there were several databases in crash recovery. The undo queue was quite high. The hope was that they would go through their crash recovery process and resolve on their own. Most of the reading you’ll do online will tell you that if a database is in crash recovery, the only thing you can do is wait. We did attempt to wait, for about 3 days, but unfortunately, progress was abysmal, and for some reason, the progress would almost reset each night.
I attempted to approach it from an angle of being as least invasive as possible by attempting to turn off HADR on the Forwarder Primary Replica for the impacted databases, but because it was in the DAG, and going through crash recovery, I was being blocked by system SPIDs that I could not kill. I even tried rebooting both replicas in AG2 in hopes of parallel recovery making this process faster. It did not. The only solution was to break down and rebuild the DAG. Here are the steps I took to get that done:
- Run the following command on the Global Primary Replica
DROP AVAILABILITY GROUP [DistributedAGNameHere];
- Run the following command on the Forwarder Primary Replica
DROP AVAILABILITY GROUP [DistributedAGNameHere];
- When I did this on the Forwarder Primary Replica, the command hung with the wait type PWAIT_HADR_WORKITEM_COMPLETED. According to SQLSkills this is a bug and you need to restart SQL. I attempted to just restart the service but that hung as well, so I rebooted the OS and that took. I refreshed the Global Primary Replica server and Forwarder Primary Replica server and saw that the DAG was no longer there.
- I then gathered the list of databases that are in the AG from the Global Primary Replica and had dynamic SQL generate the code to remove the databases from the AG on the Forwarder Primary Replica and the Forwarder Secondary Replica.
SELECT
ag.name AS availability_group_name,
adc.database_name,
'DROP DATABASE [' + adc.database_name + ']' as Drop_syntax,
'ALTER AVAILABILITY GROUP [AG2] REMOVE DATABASE [' + adc.database_name + ']' as Remove_from_ag_syntax
FROM sys.availability_groups AS ag
JOIN sys.availability_databases_cluster AS adc
ON ag.group_id = adc.group_id
ORDER BY
ag.name,
adc.database_name;
- Using the generated SQL, I ran the “Remove from AG” and “Drop database” portions to remove the databases from both replicas in AG2 so they can re-seed from the DAG after I rebuild it.
- After the databases were dropped on DB04 and DB05, I rebuilt the DAG using the script below
--RUN IN SQLCMD MODE
--Create Distributed Availability Group
--Run this on the primary replica of the primary Availability Group
:CONNECT GlobalPrimaryReplicaHere
CREATE AVAILABILITY GROUP [DistributedAGNameHere]
WITH (DISTRIBUTED)
AVAILABILITY GROUP ON
'AG1' WITH
(
LISTENER_URL = 'TCP://ListenerNameOfAG1Here.domain.com:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
),
'AG2' WITH
(
LISTENER_URL = 'TCP://ListenerNameOfAG2Here.domain.com:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);
GO
--Create second availability group on second failover cluster with replicas and listener
--Run this on the primary replica of the secondary Availability Group
:CONNECT ForwarderPrimaryReplicaHere
ALTER AVAILABILITY GROUP [DistributedAGNameHere]
JOIN
AVAILABILITY GROUP ON
'AG1' WITH
(
LISTENER_URL = 'TCP://ListenerNameOfAG1Here.domain.com:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
),
'AG2' WITH
(
LISTENER_URL = 'TCP://ListenerNameOfAG2Here.domain.com:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);
GO
Once I did this, the DAG was rebuilt, and auto-seeding took care of the rest. Things were back to normal and the client was happy.