SQL Server DR – Distributed AG vs Single AG Across Two Sites

Introduction

As a database administrator responsible for keeping your company’s mission-critical SQL Server databases up and running 24/7, designing a robust disaster recovery (DR) solution is one of your most important tasks. You never know when a catastrophic event like a natural disaster, power outage, or cyberattack could take your primary datacenter offline. Having a solid DR plan in place ensures you can quickly fail over to a secondary site and keep the business running with minimal downtime and data loss.

When it comes to SQL Server DR architectures, two common approaches are:

  1. Deploying a single availability group (AG) that spans multiple sites
  2. Configuring distributed availability groups (distributed AG)

But which one is right for your environment? In this article, we’ll compare these DR strategies, explain how they work, and discuss the benefits and trade-offs of each to help you make an informed decision.

Single Availability Group Across Multiple Sites

One DR design is stretching a single AG across a primary and secondary site, typically over a WAN link. The primary replica lives in your main production datacenter, while a synchronous-commit secondary replica is deployed to the DR site. Synchronous-commit ensures any transactions committed on the primary are hardened on the remote secondary before being acknowledged to the client, preventing data loss.

Here’s a simplified diagram:

graph TD
A[Primary Site] -- synchronous commit --> B[DR Site]

The main advantage of this approach is simplicity. With just a single AG, there are fewer moving parts to manage and monitor. Failover is also straightforward – if the primary site goes down, you manually fail the AG over to the secondary and redirect your application connections.

However, there are some downsides to consider:

  • WAN latency – Synchronous-commit AGs are very sensitive to network latency. The further apart your sites are geographically, the higher the latency, which can impact commit performance and application response times. Microsoft recommends round-trip latency be kept under 5ms for sync AGs.
  • No automatic failover – With the secondary replica in async-commit mode, automatic AG failover is not possible as some data loss may occur. You must manually fail over to the DR site, which takes time and is prone to human error.
  • Potential data loss – If an outage occurs that takes out both your primary AG replica and the DR site, some data loss is inevitable since the remote replica is slightly behind.

Distributed Availability Groups

Distributed AGs, introduced in SQL Server 2016, provide an alternative DR architecture that addresses some of those single AG limitations. With distributed AG, you deploy separate AGs in each site, typically with synchronous-commit replicas for HA and automatic failover within the site. You then link the AGs together via asynchronous replication.

Here’s what it looks like:

graph TD
A[AG1 - Primary Site] --async--> B[AG2 - DR Site]
A --> C[HA Replica]
B --> D[HA Replica] 

The key advantages of distributed AG are:

  • No impact to primary site performance – Since replication between the sites is asynchronous, there is no performance overhead or dependency on WAN latency.
  • Automatic failover within each site – With a sync-commit replica in each site’s AG, automatic failover can happen quickly if the primary replica fails, minimizing downtime.
  • Protects against data loss – By having a separate AG in the DR site, you have a consistent copy of data that is only slightly behind the primary. If both sites experience an outage, you can recover to that last hardened transaction.

However, distributed AG adds management complexity since you have multiple AGs to administer. It also requires a more manual process to fail over between sites – the database administrator must manually switch replication from AG1 to AG2 and redirect applications to the DR site.

Which One Is Right For You?

So which DR strategy should you choose for your SQL Server environment? As with most things in tech, the answer is “it depends”.

If your sites are geographically close with very low latency between them, and your RPO and RTO goals are aggressive, a single stretched AG might fit the bill. The simplicity and automatic failover are appealing.

However, in my experience, most organizations opt for distributed AG to remove that performance dependency on WAN latency and add an extra layer of protection against data loss. The additional administration overhead is generally an acceptable trade-off.

Whichever route you go, be sure to document the manual steps required for failover and DR recovery. Regularly practice those procedures to ensure everyone knows their role and can execute the failover quickly when called upon. With a well-designed and tested DR strategy based on SQL Server AGs, you’ll sleep much better at night knowing you can recover from even the worst disaster scenarios.

Microsoft link to learn more: Distributed availability groups

Related Posts

Troubleshooting Missing SQL Server Statistics

Learn how to diagnose and fix missing SQL Server statistics through a practical troubleshooting guide, including step-by-step solutions and best practices.

Read more

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from The DBA Hub

Subscribe now to keep reading and get access to the full archive.

Continue reading