SQL Server high availability and disaster recovery for AWS, Azure and GCP: A guide
The public cloud offers a myriad of options for providing high availability and disaster recovery protections for SQL Server database applications. Conversely, some of the options available in a private cloud are not available in the public cloud. Given the many choices and limitations, the challenge faced by system and database administrators is determining the best available options for each application running in hybrid and purely public clouds.
All cloud service providers (CSPs) have service level agreements (SLAs) with money-back guarantees for when uptime falls below specified levels, usually ranging from 95.00% to 99.99%. Four nines of uptime is generally accepted as constituting HA, and to be eligible for these 99.99% SLAs, the configurations need to meet certain requirements.
But be forewarned: The SLAs only guarantee “dial tone” at the server level, and explicitly excluded many causes of downtime at the database and application levels. These exclusions inevitably include natural disasters, the customer’s actions (or inactions), and the customer’s system or application software. There may also be a separate SLA for storage that is lower than the one for servers. So while it is advantageous to leverage various aspects of a CSP’s infrastructure, additional provisions are needed to ensure adequate uptime for mission-critical SQL Server databases.
Differences between HA and DR
Properly leveraging the cloud’s resilient infrastructure requires understanding key differences between “failures” and “disasters” because those differences affect the choice of provisions used for HA and DR protections. Failures are small in scale and short in duration, affecting a server, rack, or the power or cooling in a single datacenter. Disasters have more widespread and enduring impacts, and can affect multiple datacenters in ways that preclude rapid recovery.
The most consequential effect involves the location of the redundant resources (systems, software and data), which can be local—on a Local Area Network—for recovering from a localized failure. By contrast, the redundant resources required to recover from a widespread disaster must span a Wide Area Network.
For database applications that require high transactional throughput performance, the ability to replicate the active instance’s data synchronously across the LAN enables the standby instance to be “hot” and ready to take over immediately in the event of a failure. Such rapid recovery should be the goal of all HA provisions.
Data must be replicated asynchronously in DR configurations to prevent the latency inherent in the WAN from adversely impacting on the throughput performance in the active instance. This means that updates being made to the standby instance always lag behind updates being made to the active instance, making it “warm” and resulting in an unavoidable delay during the manual recovery process.
All three major CSPs accommodate these differences with redundancies both within and across datacenters. Of particular interest is the variously named “availability zone” that makes it possible to combine the synchronous replication available on a LAN with the geographical separation afforded by the WAN. These zones connect two or more regional datacenters via a low-latency, high-throughput network to facilitate synchronous data replication. With latencies around one millisecond, the use of multi-zone configurations has become a best practice for HA.
For DR, all CSPs have offerings that span multiple regions to afford additional protection against major disasters that could affect multiple zones. For example, Google has what could be called DIY (Do-It-Yourself) DR guided by templates, cookbooks and other tools. Microsoft and Amazon have managed DR-as-a-Service (DRaaS) offerings: Azure Site Recovery and CloudEndure Disaster Recovery, respectively.
For all three CSPs it is important to note that data replication across regions must be asynchronous, so the recovery will need to be performed manually to ensure minimal or no data loss. The resulting delay in recoveries is tolerable, however, because region-wide disasters are rare.
Making SQL Server “always on”
SQL Server offers two of its own HA/DR features: Always On Failover Cluster Instances and Always On Availability Groups. FCIs afford three notable advantages: inclusion in the less expensive Standard Edition; protection of the entire SQL Server instance; and support in all versions since SQL Server 7. A significant disadvantage is the need for a storage area network (SAN) or other form of shared storage, which is unavailable in the cloud. The lack of shared storage was addressed in Windows Server 2016 Datacenter Edition with the introduction of Storage Spaces Direct. But S2D also has limitations; most notably its inability to span availability zones.
SQL Server’s other HA/DR feature, Always On Availability Groups, is a more robust solution capable of providing rapid recoveries with no data loss. Among its other advantages are inclusion in SQL Server 2017 for Linux, no need for shared storage, and readable secondaries for queries (with appropriate licensing). But for Windows it requires licensing the substantially more expensive Enterprise Edition and it lacks protection for the entire SQL Server instance.
It is worth noting that SQL Server also offers a Basic Availability Groups feature, but it supports only a single database per Availability Group, making it suitable for only the smallest of environments.
The limitations associated with both options have created a need for third-party failover clustering solutions purpose-built to provide HA/DR protections for virtually all Windows and Linux applications in private, public and hybrid cloud environments. These software-only solutions facilitate, at a minimum, real-time data replication, continuous monitoring able to detect failures at the application level, and configurable policies for failover and failback. Most also offer a variety of value-added capabilities, including some specific to popular applications like SQL Server.
Failover clustering offerings afford two major advantages: SANless operation that overcomes the lack of shared storage in the cloud and application-agnosticism that eliminates the need to have different HA/DR provisions for different applications.
Editor’s note: More detailed information about the operation and benefits of SANless failover clustering is available in How to make Amazon Web Services highly available for SQL Server.
Interested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.