1. High availability and disaster recovery

Every business seeks data solutions that can address their operational requirements. These requirements translate to specific values of the Recovery Time Objective (RTO), and Recovery Point Objective (RPO).
• The RTO indicates how long the business can endure database and application outages.
• The RPO determines how much data loss is tolerable.
The combination of RTO and RPO requirements dictates what solution should be adopted.
Typically, applications with RPO and RTO values close to zero should use a high availability (HA) solution, whereas disaster recovery (DR) solutions can be used for those with higher values. In many cases, HA and DR solutions can be mixed to address more complex requirements.
>>>>>> Table 1: HA/DR options in Microsoft SQL Server <<<<<<
• Always On basic availability groups in SQL Server 2019 Standard edition support a single passive replicas (in addition to the primary replica) for a single database per availability group. For multiple databases in HA mode, a separate availability group needs to be defined for each database.
• MSSQL Failover Cluster Instance is often used as a pure HA solution.
These solutions rely on one or more secondary servers, with SQL Server running as active or passive standby. Based on the specific HA/DR requirements, these servers can be located in close proximity to each other or far apart.

1.1 Availability Zones and multi-AZ deployment

• AWS Availability Zone are a good solution for synchronous replication of your databases using Mirroring, Always On Availability Groups, Basic Availability Groups, or Failover Cluster Instances.
• SQL Server provides zero data loss and, when combined with the low-latency infrastructure of AZs, provides high performance. This is one of the main differences between most on-premises deployments and AWS. For example, Always On Failover Cluster Instance (FCI) is often used inside a single data center, because all nodes in an FCI cluster must have access to the same shared storage. Locating these nodes in different data centers could degrade performance.
• In AWS, FCI nodes can be located in separate AZs and still provide high performance because of the low-latency network link between all AZs within a Region.
• SQL Server FCI relies on shared storage being accessible from all nodes participating in FCI.
• Amazon FSx for Windows File Server is a fully managed service providing shared storage that automatically replicates the data synchronously across two AZs, provides high availability with automatic failure detection, failover, and failback, and fully supports the Server Message Block (SMB) Continuous Availability (CA) feature.
Figure 1: Using Amazon FSx as file-share for Failover Cluster Instance, or as file-share witness in Windows Server Failover Cluster

2. Using AWS Launch Wizard to deploy Microsoft SQL Server on Amazon EC2 instances

AWS Launch Wizard is a service that offers a guided way of sizing, configuring, and deploying AWS resources for third party applications, such as Microsoft SQL Server.
You can use this service to deploy MS SQL Server with following configurations:

SQL Server single instance on Windows
SQL Server single instance on Linux
SQL Server HA using Always On Availability Groups on Windows
SQL Server HA using Always On Availability Groups on Linux
SQL Server HA using Always On Failover Cluster Instance on Windows AWS Launch Wizard:

a. Provides an estimated cost of deployment, and enables you to modify your resources to instantly view an updated cost assessment.
b. Handles all the heavy-lifting, including installation and configuration of Always On Availability Groups or Failover Cluster Instance.
c. Creates CloudFormation templates that can serve as a baseline to accelerate subsequent deployments. For post-deployment management, AWS Systems Manager (SSM) Application Manager automatically imports application resources created by AWS Launch Wizard. From the Application Manager console, you can view operations and perform operations tasks.
>>> Figure 2: AWS Launch Wizard deploys MS SQL FCI using Amazon FSx for Windows File Server <<<

2.1 Multi-Region deployments

You can leverage the global scale of AWS to ensure availability under almost any circumstances. By default, Amazon VPC is confined within a single Region. Therefore, for a multi-region deployment, you need to establish connectivity between your SQL Server instances that are deployed in different Regions. In AWS, there are a number of ways to do this:

VPC peering
AWS Transit Gateway
VPN connections
VPC sharing

• If applications or users were deployed in remote Regions which need to connect to your SQL Server instances, you can use the AWS Direct Connect.
• It is possible to have synchronous replication in a multi-region SQL Server deployment, the farther apart your selected Regions are, the more severe the performance penalty is for a synchronous replication.
• The best practice for multi-region deployments is to establish an asynchronous replication, especially for Regions that are geographically distant.
• For those workloads that come with aggressive RPO requirements, asynchronous multi-Region deployment can be combined with a Multi-AZ or Single-AZ synchronous replication.
• You can also combine all three methods into a single solution.
• In cases involving several replicas across two or more Regions, distributed availability groups might be the most suitable option.
• Distributed availability groups can also be used to increase the number of read replicas.
• A traditional availability group allows up to 8 read replicas. This means you can have a total of 9 replicas, including the primary. Using a distributed availability group, a second availability group can be added to the first, increasing the total number of replicas to 18. This process can be repeated with a third availability group and a second distributed availability group.
• The second distributed availability group can be configured to include either the first or second availability groups as its primary.
• Distributed availability group is the means through which SQL Server Always On can achieve virtually unlimited scale.
• Another use of a distributed availability group is zero downtime database migrations. The independence of SQL Server Distributed Availability Group from Active Directory and Windows Server Failover Cluster (WSFC) is the main benefactor for these cases. It enables you to keep both sides of the migration synchronized without having to worry about the complexities of Active Directory or WSFC.
>>>>> Figure 2: SQL Server distributed availability group in AWS <<<<<

2.2 Disaster recovery

• DR solutions require a replica of SQL Server databases in another server. The other server is often in a remote site far away from the primary site. This means higher latency, and therefore, lower performance if you rely on HA solutions that use synchronous replication.
• DR solutions often rely on asynchronous replication of data.
• DR solutions are based on either block-level or database-level replication.
• AWS also provides CloudEndure Disaster Recovery. You can use CloudEndure DR to reduce downtime to a few minutes, protect against data loss for sub-second RPO, simplify implementation, increase reliability, and decrease the total cost of ownership.
• CloudEndure is an agent-based solution that replicates entire virtual machines, including the operating system, all installed applications, and all databases, into a staging area.
• The staging area contains low-cost resources automatically provisioned and managed by CloudEndure Disaster Recovery. Because the staging area does not run a live version of your workloads, you don’t need to pay for duplicate software licenses or high-performance compute. Rather, you pay for low-cost compute and storage.
• The fully provisioned recovery environment, with the right-sized compute and higher-performance storage required for recovered workloads, is launched only during a disaster or drill. AWS also makes CloudEndure available at no additional cost for migration projects.
>>>>> Figure 3: CloudEndure disaster recovery <<<<<

3. Performance optimization

3.1 Using Amazon Elastic Block Store (Amazon EBS)

• Amazon EBS is a Single-AZ block storage service with a number of flexible options to cater to diverse requirements.
• For maximizing performance on a single volume, use a Provisioned IOPS Solid State Drive (SSD).
• You can provision up to 64,000 IOPS per io2 EBS volume, along with 1000-MiB/s throughput.
• For more demanding workloads, the io2 Block Express EBS volumes guarantee 256,000 IOPS and 4,000 MiB/s throughput per volume.
• For more IOPS and throughput than provided by a single EBS volume, you can create multiple volumes and stripe them in your Windows or Linux instance.
• Remember to use EBS-optimized EC2 instance types, this means a dedicated network connection is allocated to serve requests between your EC2 instance and the EBS volumes attached to it.
• gp2 and gp3 volumes offer a better balance of price and performance for SQL Server workloads. gp2 volumes deliver single-digit ms latencies and the ability to burst to 16,000 IOPS for extended periods. The IOPS load generated by SQL Server database tends to spike frequently.
• One of the major benefits of using EBS volumes is the ability to create point-in-time and instantaneous EBS snapshots. This feature copies the EBS snapshot to Amazon S3 infrastructure, which provides 99.999999999% durability.
• Despite EBS volumes being confined to a single AZ, EBS snapshots can be restored to any AZ within the same Region.
• The block-level snapshots are not the same as database backups, and not all features of database backups are attainable this way. Therefore, this method is often combined and complemented with a regular database backup plan.
• Although each EBS volume can be as large as 64 TB, and therefore, could take a long time to transfer all its data to Amazon S3, EBS snapshots are always point-in-time.
• After restore a volume from a snapshot, the volume is immediately available to applications for read and write but it takes some time to get its full performance capacity. Using Amazon EBS fast snapshot restore, helps in eliminate the latency of I/O operations on a block when it is accessed for the first time.
• You can use AWS Systems Manager Run Command to take application-consistent EBS snapshots of your online SQL Server files at any time, with no need to offline your database or put it in read-only mode. The snapshot process uses Windows Volume Shadow Copy Service (VSS) to take image-level backups of VSS-aware applications. MS SQL Server is VSS-aware and perfectly compatible with this technique. It is also possible to take VSS snapshots of Linux instances.
• You can also take crash-consistent EBS snapshots across multiple EBS volumes, attached to a Windows or Linux EC2 instance, without using orchestrator applications. Using this method, only lose uncommitted transactions and writes that are not flushed to the disk.
• SQL Server is capable of restoring databases to a consistent point before the crash time, this feature is also supported through AWS Backup.

3.2 Instance storage

• Storage-optimized EC2 instance types choosing storage-optimized EC2 instance types for maximum performance, it is essential to understand that some of the smaller instance types provide instance storage that is shared with other instances.
• These are virtual disks that reside on a physical disk attached to the physical host.
• By selecting instance type, such as i3.2xlarge, you ensure that there is a 1:1 correspondence between your instance store disk and the underlying physical disk.
• Instance disks are ephemeral and live only as long as their associated EC2 instance.
• If the EC2 instance fails, or stopped or ended, all of its instance storage disks are wiped out and the data stored on them is irrecoverable. Unlike EBS volumes, instance storage disks cannot be backed up using a snapshot. Therefore, if you choose to use EC2 instance storage for your permanent data, you need to provide a way to increase its durability.
• One suitable use for instance storage may be the tempdb system database files, those files are recreated each time the SQL Server service is restarted.
a. As a best practice, the tempdb files should be stored on a fast volume, separate from user databases.
b. For the best performance, ensure that the tempdb data files within the same filegroup are the same size and stored on striped volumes. Another use for EC2 instance storage is the buffer pool extension. This feature uses fast random-access disks (SSD) as a secondary cache between RAM and persistent disk storage, striking a balance between cost and performance when running workloads on SQL Server.
• Although instance storage disks are the fastest available to EC2 instances, their performance is capped at the speed of the physical disk.
• You can go beyond the single disk maximum by striping across several disks. You could also use instance storage disks as the cache layer in Storage Spaces (for single Windows instances) and Storage Spaces Direct (for Windows Server failover clusters) storage pools.

3.3 Amazon FSx for Windows File Server

Amazon FSx for Windows File Server is another storage option for SQL Server on Amazon EC2. This option is suitable for three major use-cases:

As shared storage used by SQL Server nodes participating in a Failover Cluster Instance.
As file-share witness to be used with any SQL Server cluster on top of Windows Server Failover Cluster.
As an option to attain higher throughput levels than available in dedicated EBS optimization To better understand the third case:

a. Notice that EBS throughput depends on EC2 instance size.
b. Smaller EC2 instance sizes provide lower EBS throughput; therefore, to attain EBS higher throughput, you need bigger instance sizes.
c. If a workload leaves a big portion of its network bandwidth unused, but requires higher throughput to access underlying storage, using a shared file system over SMB may unlock its required performance, while reducing cost by using smaller EC2 instance sizes.

Amazon FSx provides fast performance with baseline throughput up to 2 GB/second per file system, hundreds of thousands of IOPS, and consistent sub-millisecond latencies.
To provide the right performance for your SQL instances, you can choose a throughput level that is independent of your file system size. Higher levels of throughput capacity also come with higher levels of IOPS that the file server can serve to the SQL Server instances accessing it.
The storage capacity determines not only how much data you can store, but also how many IOPS you can perform on the storage – each GB of storage provides three IOPS. You can provision each file system to be up to 64 TB in size.

3.4 Bandwidth and latency

It is important to remember the difference between latency and bandwidth. You should find a balance between network latency and availability. To gain the highest bandwidth on AWS, you can leverage enhanced networking and Elastic Network Adapter (ENA), or the new Elastic Fabric Adapter (EFA) which, when combined with new generation of EC2 instances such as C6gn, C5n, R5n, I3en, or G4dn instances, can provide up to 100Gbps bandwidth. But this quite high bandwidth has no effect on latency. Network latency changes in direct correlation with the distance between interconnecting nodes. Clustering nodes is a way to increase availability, but placing cluster nodes too close to each other increases the probability of simultaneous failure, reducing availability. Putting them too far apart yields the highest availability, but at the expense of higher latency. AWS AZs within each AWS Region are engineered to provide a balance that fits most practical cases. Each AZ is engineered to be physically separated from other AZs, while keeping in close geographic proximity to provide low network latency. Therefore, in the overwhelming number of cases, the best practice is to spread cluster nodes across multiple AZ.

3.5 Read replicas

You might determine that many of your DB transactions are read-only queries, and that the sheer number of incoming connections is flooding your database. Read replicas are a known solution for this situation. You can offload your read-only transactions from your primary SQL Server instance to one or more read replica instances. Read replicas can also be used to perform backup operations, relieving primary instance from performance hits during backup windows. When using availability group listeners, if you mark your connection strings as read-only, SQL Server routes incoming connections to any available read replicas and only sends read/write transactions to the primary instance.
Always On Availability Groups supports up to four secondary replicas. There might be cases where you have users or applications connecting to your databases from geographically dispersed locations. If latency is a concern, you can locate read replicas close to your users and applications.

4. Security optimization

AWS security features can be combined with the built-in security features of MS SQL Server.

4.1 Amazon VPC

A best practice is to deploy your SQL Server instances in private subnets inside a VPC, and only allow access to the internet through a VPC NAT gateway, or a custom NAT instance.

4.2 Encryption at rest

If you are using EBS volumes to store your SQL Server database files, you have the option to enable block-level encryption. Amazon EBS transparently handles the encryption and decryption. This is available through a simple check box. Amazon FSx for Windows File Server also includes built-in encryption at rest. Both EBS and Amazon FSx are integrated with AWS KMS for managing encryption keys. So, through AWS KMS, you can either use keys provided by AWS, or bring your own keys. At the database level, you can use SQL Server Transparent Data Encryption (TDE), a feature available in MS SQL Server that provides transparent encryption of your data at rest. TDE is available on Amazon RDS for SQL Server, and you can also enable it on your SQL Server workloads on EC2 instances. Previously, TDE was only available on SQL Server Enterprise Edition. However, SQL Server 2019 has also made it available on Standard Edition. If you want to have encryption-at-rest for your database files on Standard Edition on an earlier version of SQL Server, you can use EBS encryption instead. It’s important to understand the tradeoffs and differences between EBS encryption and TDE. EBS encryption is done at the block level, that is, data is encrypted when it is stored and decrypted when it is retrieved. However, with TDE, the encryption is done at the file level. Database files are encrypted, and can only be decrypted using the corresponding certificate.

4.3 Encryption in transit

You can enable encryption in transit for your SQL Server workloads using the SSL/TLS protocol. MS SQL Server supports encrypted connections, and SQL Server workloads in AWS are no exception. When using SMB protocol for SQL Server storage layer, Amazon FSx automatically encrypts all data in transit using SMB encryption as you access your file system, without the need for you to modify SQL Server or other applications’ configurations.

4.4 Encryption in use

MS SQL Server offers Always Encrypted to protect sensitive data using client certificates. This provides a separation between those who own the data and can view it, and those who manage the data but should have no access. This feature is also available on both Amazon RDS for SQL Server, as well as SQL Server workloads on Amazon EC2.

4.5 AWS Key Management Service (AWS KMS)

AWS KMS is a fully managed service to create and store encryption keys. You can use KMS-generated keys or bring your own keys. In either case, keys never leave AWS KMS and are protected. You can use KMS keys to encrypt your SQL Server backup files when you store them on Amazon S3, Amazon S3 Glacier, or any other storage service. Amazon EBS encryption also integrates with AWS KMS.

4.6 Security patches

In AWS, you can use AWS Systems Manager Patch Manager to automate this process. Note that use cases for Patch Manager are not restricted to security patches.

5. Cost optimization

SQL Server can be hosted on AWS through License Included (LI) and Bring Your Own License (BYOL) licensing models. The advantage of LI model is that you do not need to have any long-term commitments and can stop using the product at any time and stop paying for its usage.
However, many businesses already have considerable investments in SQL Server licenses and might want to reuse their existing licenses on AWS. This is possible using BYOL:

If you have Software Assurance (SA), one of its benefits is the Microsoft License Mobility through Software Assurance program. This program enables you to use your licenses on server instances running anywhere, including on Amazon EC2 instances.
If you don’t have SA, you may still be able to use your own licenses on AWS using Amazon EC2 Dedicated Hosts. The BYOL option on EC2 Dedicated Hosts can significantly reduce costs, as the number of physical cores on an EC2 host is about half of the total number of vCPU available on that host. However, it’s difficult to track the license usage and compliance.

5.1 Using SQL Server Developer Edition for non-production

One of the easiest ways to save licensing costs is to use MS SQL Developer Edition for environments that are not going to be used by application end-users. For this, you can download SQL Server Developer Edition installation media and install it on your EC2 instances. SQL Server Developer Edition is equivalent to SQL Server Enterprise Edition, with full features and functionality.

5.2 Amazon EC2 CPU optimization

The z1d instance types provide the maximum CPU power, enabling you to reduce the number of CPU cores for compute-intensive SQL Server deployments. The SQL Server deployments might not be compute-intensive and require an EC2 instance type that provides intensity on other resources, such as memory or storage. Because EC2 instance types that provide these resources are also providing a fixed number of cores that might be more than your requirement, AWS offers to use EC2 CPU optimization to reduce the number of cores available to an EC2 instance.

5.3 Switch to SQL Server Standard Edition

• Enterprise-grade features of SQL Server are exclusively available in the Enterprise edition.
• Many of these features have also been available in the Standard edition, enabling you to switch to the Standard edition if you’ve been using Enterprise edition only for those features.
• One of the most common reasons for using Enterprise edition has always been its mission-critical HA capabilities.
• Now, there are alternative options that enable switching to Standard edition without degrading availability. One option is using Always On Basic Availability Groups. This option is similar to Always On Availability Groups, but comes with a number of limitations. The most important limitation is that you can have only one database in a basic availability group. The other option is using Always On Failover Cluster Instance (FCI).
• Since FCI provides HA at the instance level, it doesn’t matter how many databases are hosted on your SQL Server instance. Traditionally this option was restricted to HA within a single DC.
To simplify the complexity and cost of running MS SQL FCI deployments using Amazon FSx in the following scenarios:

Due to the complexity and cost of implementing a shared storage solution for FCI, you might have opted to use availability groups and SQL Server Enterprise Edition.
You might already use SQL Server FCI with shared storage using a third-party storage replication software solution. That implies that you purchased a license for the storage replication solution, and then deployed, administered, and maintained the shared storage solution yourself. You can now switch to using a fully managed shared storage solution with Amazon FSx, simplifying and reducing costs for your SQL Server FCI deployment.
You ran your SQL Server Always On deployment on-premises, using a combination of FCI and AG-FCI to provide HA within your primary data center site, and AG provided a DR solution across sites.

5.4 Z1d and R5b EC2 instance types

This instance type is optimized for workloads with high licensing costs, such as MS SQL Server and Oracle databases. For example:
>>> Figure 4: TCO comparison between SQL Server on r4.4xlarge and z1d.3xlarge <<<

5.5 Eliminating active replica licenses

Another opportunity for cost optimization is through applying a combination of BYOL and LI models. Active replicas are used primarily for:

Reporting
Backup
OLAP Batch jobs
HA The first three options are often performed intermittently. This means you would not need an instance continuously up and dedicated to running those operations. In a traditional on-premises environment, you would have to create an active replica that is continuously synchronized with the primary instance. This means you need to obtain an additional license for the active replica. >>> Figure 5: SQL Server active replication on-premises <<< In AWS, you can optimize this architecture by replacing the active replica with a passive replica, therefore relegating its role solely to the purpose of HA. Other operations can be performed on a separate instance using License Included, which could run for a few hours and then be shut down or ended. The data can be restored through an EBS snapshot of the primary instance. >>> Figure 6: Eliminating active replica licenses in AWS <<<

This solution is applicable when jobs on the active replica run with a low frequency. If you need a replica for jobs that run continuously or at a high frequency, consider using AWS Database Migration Service to continuously replicate data from your primary instance into a secondary. The primary benefit of this method is because you can do it using SQL Server Standard edition.

6. SQL Server on Linux

Deploying SQL Server on Linux is a way to eliminate Windows license costs. Installation and configuration of MS SQL on Linux can be non-trivial.

7. Operational excellence

It must considering the post-deployment operating and maintaining of the MSSQL Server workloads in AWS. The best practice is to assume that failures and incidents happen all the time. It’s important to be prepared and equipped to respond to these incidents. This objective is composed of three subobjectives:

Observe and detect anomaly
Detect the root cause
Act to resolve the problem

7.1 Observability and root cause analysis

Amazon CloudWatch is a service that enables real-time monitoring of AWS resources
and other applications. Amazon CloudWatch Application Insights for .NET and SQL Server is a feature of Amazon CloudWatch that is designed to enable operational excellence for Microsoft SQL Server and .NET applications. It continuously monitors the metrics and logs to detect anomalies and errors, while using AI / ML to correlate detected errors and anomalies.

7.2 Reducing mean time to resolution (MTTR)

The automated dashboards generated by Amazon CloudWatch Application Insights help you to take swift remedial actions to keep your applications healthy and to prevent impact to the end users of your application. It also creates OpsItems so you can resolve problems using AWS Systems Manager OpsCenter.

7.3 Patch management

AWS Systems Manager Patch Manager is a comprehensive patch management solution, fully integrated with native Windows APIs, and supporting Windows Server and Linux operating systems, as well as Microsoft applications, including Microsoft SQL Server

Blog

Best Practices for Deploying Microsoft SQL Server on Amazon EC2 | AWS White Paper Summary

Haytham Mostafa