Azure Stack HCI Learning Path – Part 3 – Storage Spaces Direct (S2D)

By Sameer in Azure, Azure Stack HCI, HCI, Microsoft on December 9, 2021April 10, 2022

In the previous part we talked about Azure Arc, which is integral in our Azure Hybrid Cloud story. Azure Arc is essentially the glue that sticks our on-prem or multi-cloud infrastructure to the Azure portal, for unified management and control. We are now ready to move on to other infrastructure focused concepts such as storage, compute, and networking, and in case of HCI, these are software-defined.

First, lets talk about storage. Azure Stack HCI implements software-defined storage with the technology called “Storage Spaces Direct (S2D)”. S2D leverages the locally attached storage drives on each server and combines them all together to make one big storage pool which has built-in features like caching, fault tolerance, erasure coding (parity), scalability and many more.

The documentation on S2D is pretty great, and I highly recommend you all to check it out. For the purpose of this blog, we will quickly touch upon some major benefits and features S2D provides on a high level.

Caching

S2D creates a built-in, persistent, and fault-tolerant read and write cache automatically. The selection of drives that constitute the cache vs capacity depends on the drive type. Currently, S2D works with four types of storage devices:

PMem (Persistent Memory)
NVMe (Non-Volatile Memory Express)
SSD (Solid State Drive)
HDD (Hard Disk Drive)

You can have an all-flash drives deployment (NVMe and/or SSD) or hybrid deployments (combination of flash drives and HDDs). The faster ones will be allocated for cache and slower ones for capacity. Important thing to remember here is that the storage assigned as cache does not contribute to raw storage capacity, so the total raw storage capacity of your deployment is the sum of your capacity drives only.

Fault Tolerance and Storage Efficiency

S2D implements resiliency by default. Depending on the number of servers participating in S2D, the resiliency strategy changes to implement either “Mirroring” or “Parity (Erasure coding)” or a combination of both, called “Mirror-accelerated Parity”.

Mirroring

Mirroring provides fault tolerance by keeping multiple copies of all data. Which simply means all the data in its entirety is duplicated multiple times, on different hardware drives for fault tolerance.

If your cluster has only 2 servers, S2D implements two-way mirroring where you have 2 copies of all the data, so the storage efficiency is 50%. This means if you wanted useable raw storage worth 1TB, your S2D storage pool should have the raw capacity of 2TB.

If your cluster has 3 servers, S2D implements three-way mirroring where you have 3 copies of all the data. This drops the storage efficiency to 33.3%, but gives you higher resiliency since you now have 2 additional copies of data instead of only one. So if you wanted useable raw storage worth 1TB, your S2D storage pool needs to have the raw capacity of 3TB.

Parity (Erasure Coding)

As you go beyond a 3-server cluster the resiliency method changes from Mirroring to Parity. Instead of copying the entire data multiple times, Parity provides fault tolerance by using bitwise arithmetic. How it does it is sometimes quite complex and not really in the scope of most our jobs here so we’ll skip that part. All we must understand here is that Parity encoding provides you better fault tolerance without compromising much on the storage efficiency and the efficiency actually keeps getting better with the number of servers you add in the pool.

There are two kinds of Parity Encoding, Single Parity and Dual Parity, out of which the latter is preferred due to its ability to sustain more hardware failures at once.

Mirror-accelerated Parity

Lastly, we have the combination of both. It starts from 4 servers onwards and as you’d expect, it’s resiliency and storage efficiency are in-between Mirroring and Parity alone.

You may be wondering, why use this at all if you get better results with dual parity alone? The answer is that it is recommended that you use mirroring for your performance-sensitive workloads and Parity for others so the ability to mix the two for parts of your storage capacity sometimes provides better results.

Drive Symmetry

It is recommended that all the servers participating in S2D have exactly same drive configuration. This includes the same number of drives, same capacity, same types, and even same models.

Of course, this is not always practical in the real world, and you may not have the same hardware lying around whenever you need it. So in practice, if you can’t use the same configuration, the next best thing is to select the closest one.

Phew! That was a quite a lot to digest at once. But hopefully it has given you better understanding of S2D which is crucial in your implementation of Azure Stack HCI. As I mentioned earlier, this is only scraping the surface and there’s a lot more to this which is covered very well in Microsoft documentation.

Next up, we will talk a bit about clusters themselves and the built-in DR strategy in Azure Stack HCI.

See you in the next one!

Tags: Azure, Azure Stack, Azure Stack HCI, HCI, Hybrid, S2D, Storage Spaces Direct