Testing Stretched Clusters

Stretched clusters is a feature that provides native disaster recovery and business continuity to clusters in Azure Stack HCI. It allows you to stretch Azure Stack HCI clusters between multiple sites using Storage Replica replication and automatic VM failover, keeping all nodes in sync.

We perform tests to see if any issues arise when creating stretched clusters for our customers. We also want to see how a stretched cluster behaves when it experiences an unexpected outage by testing the cluster’s resiliency.

In this blog, we will take you through three different stretch cluster tests that we ran with one of our DataON Kepler 2-node solutions for a customer. Kepler 2-node solutions are directly connected to each other, without needed a switch in between.

Test 1: Creating a stretched cluster in Windows Admin Center

We created a stretched cluster using Windows Admin Center, using the cluster creation tool and selected the option for two sites.

For the network session, it is alright for the networking part to fail since this system is a two 2-node, direct-connect stretched cluster. At the network session when Windows Admin Center uses one node to ping the other three nodes, it will fail to ping 2 out of the 3 nodes since the other 2 nodes are using different subnet. In the network session, there is a warning stating that that Windows Admin Center does not care for subnet and VLAN, but you can still set one up.

Failing the network test will not prevent you from continuing to the validation test unlike normal cluster creation. Since the ping test failed, the networking portion of the validation test will also fail. However, you can continue to create the stretched cluster.

Now the stretched cluster has been successfully created, but the volume was not. This customer’s order required a NESTED MAP volume. Due to the customer’s time constraint, the volume portion was skipped, and other testing will be performed. We will continue creating volume in a different stretched cluster order.

Test 2: Creating stretch cluster with an existing cluster

When creating stretched cluster with an existing cluster, you must create the sites for the two clusters. Since we already had one cluster up, the sites did not change for it.

1/ The PowerShell commands below are for creating the two sites needed for stretched cluster. This will error out since the site already exists.

New-ClusterFaultDomain -CimSession “azsrit3-cluster” -FaultDomainType Site -Name “Anaheim”

2/ This will not error out since the site has not been set up.

New-ClusterFaultDomain -CimSession “azsrit3-cluster” -FaultDomainType Site -Name “rancho”

Next, assign the server node to the site. The existing cluster will already be assigned to that cluster fault domain. The newly added server will need to be added to the site that was created.

3/ The command will do nothing because it does not change since the cluster was already created on the site.

Set-ClusterFaultDomain -CimSession “azsrit3-cluster” -Name “azsrit3-n1″, ” azsrit3-n-2″, “azsrit3-n3” -Parent “Anaheim”

4/ The command will assign the three nodes used for the stretch cluster to the other site.

Set-ClusterFaultDomain -CimSession “azsrit3-cluster” -Name “azsrit3-n4″, ” azsrit3-n-5″, “azsrit3-n6” -Parent “rancho”

5/ Verify that the cluster is assigned to the correct site.

Get-ClusterFaultDomain

Once the site is established, enable Storage Spaces Direct (S2D) on the new site, and join the node to the cluster. Since we made a mistake not enabling S2D on the secondary site (Rancho) before joining the cluster, S2D was able to automatically enable S2D on the secondary site and create the storage pool after it was joined to the cluster. This process took about 10 to 15 minutes for the new pool show up on failover in Windows Admin Center.

6/ Enable cluster S2D (used for the test).

Enable-ClusterStorageSpacesDirect

7/ Enable cluster S2D if you want to name the storage pool.

Enable-ClusterStorageSpacesDirect -PoolFriendlyName “$ClusterName Storage Pool”

The stretched cluster was created but did not configure correctly. The virtual disk on “S2D on azsrit3-cluster” storage pool was not replicating to the pool in Rancho. Windows Admin Center does see the stretch cluster, the two pools, and the site which the nodes live on. The virtual disk at the time has no VM to perform any test on the stretched cluster, and the VD replication was not enabled, it was easier to destroy the VD and recreate with replication. We recreated the VD with replication and created 10 VM for testing on the Anaheim site.

Issues encountered:

  1. When the stretched cluster is created with the existing cluster, the existing virtual disk does not replicate. To make the existing virtual disk replicate to the secondary site, you will need to create a new VD the same size, test the SRTopology, and create the SRPartnership. We got held up on creating the SRPartnership so we will have to stop and delete the VD and recreate it in Windows Admin Center with replication due to time constraints.
  1. When creating virtual disks in Windows Admin Center with replication we were only able to create 1 virtual disk. Windows Admin Center said we did not have enough storage to create any more virtual disks even though I have more than enough storage to create two more virtual disks. Failover Manager shows that the storage pool has enough space. This could be a bug in Windows Admin Center or an issue with the stretched clustering setup. We will have to create a new stretched cluster without using an existing cluster to verify if this is a bug or not in Windows Admin Center.

Test 3: Testing stretched cluster resiliency

First, set up a three-node stretch cluster, then, set up one cluster as direct-connect and the other connected via a switch.

The first test is to see what will happen if the cluster has to be shut down for any reason by the administrator. The cluster has 10 running VMs on one node at the time of the test. All nodes on one site are set to shut down. At the time of shutdown, Failover Manager and Windows Admin Center show the node was being drained of the role. This process took about 15 to 20 minutes for 10 VMs to drain off the cluster. Once it was fully drained and shut down, the volume on the other site was attached and all roles were running. Some VMs were on different node on the secondary cluster.

Second test is to see what will happen if the cluster has an unexpected shutdown. Using the same setting as before but this time it was forcefully shutdown in the BMC. The process was shorter, and we see the VM running on the other secondary site in about 5 to 10 minutes. The volume was attached and the VM was back to running state. Some of the VMs were on different nodes, but that was expected. Make sure you set up a file witness before running this test.