Gartner HCI Report, WSSD Solutions… Microsoft is ready for prime time!

Ben Milbourne

Three weeks ago, Gartner published their highly anticipated 2018 Magic Quadrant for Hyperconverged Infrastructure.  It’s the first time that Gartner has included software-only, “bring-your-own-hardware” hyper-converged systems, and Microsoft debuted at the forefront of the “visionaries’” quadrant: The report goes on to state,

“In its 2018 release, Microsoft will strengthen its data deduplication capabilities with Resilient File System (ReFS) and Mirror-accelerated parity volumes. This is tied to simplified HCI management with Systems Center and improved resilience, hyperscale and storage replicas, and an expanded ecosystem, including Linux container support.

Microsoft should be considered for all general-purpose infrastructure tied to Windows Server and Azure, in particular for business-critical, cloud and ROBO use cases. In percentage terms, few Microsoft customers have deployed its HCI. Feedback is very nascent and dependent on the 2018 release. Our research indicates thousands of customers are engaged in its HCI initiative, so momentum is building. Over the next year, Gartner expects Microsoft to increase its Windows Server and HCI migration with a broader range of integrated applications; an increased focus on cloud, mission-critical and ROBO opportunities; and additional Azure support and integration.

STRENGTHS

  • Microsoft’s HCI success is built on the sheer size of the Windows Server installed base, where even a small addressable market adoption for Storage Spaces Direct represents significant success in the HCI on-premises market.
  • With Azure momentum, Microsoft can target HCI for on-premises and off-premises, spanning both. It can also transition clients from IaaS/PaaS/SaaS to unified management with Microsoft’s System Center.
  • With HCI, Azure and Office 365, Microsoft has shifted from asset/license procurement to software rental models, allowing HCI/Storage Spaces to compete with OSS alternatives, tying support to Microsoft’s Software Assurance licensing program”

Microsoft’s really starting to shake up the hyper-converged market with their “do-it-yourself” solution in Storage Spaces Direct, and because it’s built-in to Windows Server 2016, many organizations are evaluating Storage Spaces Direct HCI with their upcoming hardware refreshes and data center buildouts.

So this begs the question… with all the options out there, how do you choose the right Storage Spaces Direct solution for your business?  Anyone could essentially follow Microsoft’s documentation, pick-and-choose parts from the www.WindowsServerCatalog.com and build a DIY Storage Spaces Direct HCI (which is great for demonstration or proof-of-concept purposes), but Carmen Crincoli, the Senior Program Manager of Windows Server, gives us an insightful look into why Microsoft specifically recommends Windows Server Software-Defined (WSSD) Solutions, especially for mission-critical applications and workloads:

The technical value of WSSD validated HCI solutions, Part 1

February 21, 2018 by Carmen Crincoli

In the previous blog post I discussed the high-level ideas behind our solution validation program, and the technical merits it accrues for people who buy and use those solutions. Building on those concepts, I’m going to dive into one particularly thorny integration challenge partners face when creating these solutions. I’ve been working with Windows and hardware at Microsoft for over 20 years. I know many of you have a similar experience in the industry. You’re probably pretty certain you know how to make these systems sing together. I’m here to tell you it’s not as straightforward as your past experiences might lead you believe.

Standalone vs distributed systems

The way most servers and storage have been designed and validated in the PC ecosystem (until very recently) has been as standalone systems. You buy a server, you get the parts and sizes you need to support your workload, connect it to external networks and storage, and off you go. The integration work is all done by the OEM before they turn around and sell the system to you. Any external dependencies aren’t necessarily guaranteed, and often need to be tuned and configured to work properly in a customer environment for the best experience. Those external systems often undergo their OWN integration testing. SANs, networks, and other dependent infrastructure are tested and sold in their own silos to work under specific conditions.

The world of HCI blurs those lines dramatically. All of those different parts now converge into the same set of systems and need to work quickly and flawlessly with each other. The server is now the storage and the network and the compute, all in one. All of those separate integration steps need to be considered as part of the system design, not just the ones for the server itself. This is where “whole-solution validation” comes in. I’m going to dive into one of the thorniest and most problematic areas, the storage subsystem.

Off-the-shelf vs vendor supplied

One of the simplest areas to overlook is the differences between retail or channel supplied parts, and their vendor-tuned equivalents. One of the lesser-discussed features of the storage world is how specialized different versions of otherwise identical devices can become, as requested by different vendors. This isn’t anything specific to an OS or an OEM, it’s industry-wide.

Let’s say vendor A has a very popular model of disk. They’ll take that one disk and sell it into the retail channel under their enterprise label. The firmware will be tuned for maximum compatibility, since they don’t know what it will be attached to, and what features are needed or supported. It won’t use aggressive timings. It won’t implement special commands. It won’t do anything fancy that might break in a general configuration. Now, they take that same disk, and sell it to one of their major OEM customers. That OEM has LOTS of requirements. They sell their own SANs. They sell servers. They sell hardened systems that get installed on oil rigs. They sell all kind of things that one disk might need to be plugged into. All of those uses might have their own special needs. They have specialized diagnostics. They have aggressive timings, which allows them to have better performance and latencies than their competitors. They only have 1 or 2 types of HBA they support, so their testing matrix is small. The disk vendor will turn around and sell that device to them with a specialized firmware load that meets all their needs. Now repeat that procedure for a totally different OEM with very different requirements. You now have 3 versions of the same disk, all slightly different. Repeat another few times…you see where this is going, right?

Now, for all intents and purposes those disks look and act like any other disk 99% of the time. After all, there are industry standards around these things, and you’re not going to get many sales for a SATA or SAS disk that won’t ACT like a SATA or SAS disk when addressed by the system. HOWEVER, the 1% of the time it’s slightly different is often when it matters most. When the system is under stress, or when certain commands are issued in a certain order, it can result in behaviors that no one ever explicitly tested for before. OEM A tested for feature X in this scenario. OEM B didn’t test for it at all because they don’t use it that way. Put it into System C, and maybe the firmware on the disk just checks out and never comes back. Now you have a “dead” disk that’s simply an interop bug that no one but YOU would have found, because you’re building from a parts list instead of from a solution list.

You can take that above, very-real example, and apply it across the entire storage chain. You’ll quickly discover that a real compatibility matrix for a small set of physically identical devices multiplies into HUNDREDS of unique configurations. That unfortunate reality is why even though there is a healthy list of parts with the required Software-Defined Data Center AQ’s, we want customers invested in solutions which were designed and tested from end-to-end by our partners and OEMs, rather than just a list of certified devices.

Architectural differences matter too

Now, your first thought after reading that might be, “Fine, I’ll simply use SDDC parts from my preferred OEM, knowing that they’ve been designed to work together.” While that could eliminate certain types of interop bugs, it still leaves the chance for architectural changes that were never tested together as part of an integrated solution design. As an example, mixing SAS and SATA devices on the same storage bus could result in unexpected issues during failures or certain types of I/O. While technically supported by most HBAs and vendors, unless you’ve actually tested the solution that way, you have an incomplete picture of how all the pieces will work together. Another example is that not all SSD’s are created equal. NVMe devices offer a tremendous speed and latency benefit over more traditional SAS and SATA SSDs. With that boost can come a much higher performance ceiling for the entire system, which can result in heavy and unusual I/O patterns for the other storage devices on the system. One configuration using SATA SSD’s and HDD’s may behave very differently the second you swap the SATA SSDs for NVMe ones, despite the fact that all of them may be SDDC certified by one vendor.

This isn’t exclusively a storage problem, either. There are often multiple NICs in an OEM’s catalog that can support high-performance RDMA networks. Some of them will use RoCE, some will use iWARP, likely none of them will be supported by the vendor in a mixed environment, and often they require very specific firmware revisions for different whole solution configurations to ensure maximum reliability and performance. If no one ever tested the whole system as a solution with all the pieces well-defined, from end-to-end, the final reliability of the solution can only be speculated on.

Conclusion

These posts aren’t meant to make blanket statements about the supportability of do-it-yourself HCI and S2D configurations. Assuming your systems have all the proper logos and certifications, and your configuration passes cluster validation and other supportability checks, Microsoft will support you. However, it’s very easy to get caught with a slightly different version of the hardware, firmware, tested drivers, and other components. Building and tracking this in-house is not a trivial task! That’s why we wanted to make it clear why we’re running the Windows Server-Software Defined program, and what benefits you can expect by purchasing one of these configurations for your critical workloads. We feel confident that you will have a better HCI experience with Windows Server over the lifetime of the solution via this program than by building it on your own.

The technical value of WSSD validated HCI solutions, Part 2

February 21, 2018 by Carmen Crincoli

In the previous blog post I discussed the high-level ideas behind our solution validation program, and the technical merits it accrues for people who buy and use those solutions. Building on those concepts, I’m going to dive into one particularly thorny integration challenge partners face when creating these solutions. I’ve been working with Windows and hardware at Microsoft for over 20 years. I know many of you have similar experience in the industry. You’re probably pretty certain you know how to make these systems sing together. I’m here to tell you it’s not as straightforward as your past experiences might lead you believe.

Standalone vs distributed systems

The way most servers and storage have been designed and validated in the PC ecosystem (until very recently) has been as standalone systems. You buy a server, you get the parts and sizes you need to support your workload, connect it to external networks and storage, and off you go. The integration work is all done by the OEM before they turn around and sell the system to you. Any external dependencies aren’t necessarily guaranteed, and often need to be tuned and configured to work properly in a customer environment for the best experience. Those external systems often undergo their OWN integration testing. SANs, networks, and other dependent infrastructure are tested and sold in their own silos to work under specific conditions.

The world of HCI blurs those lines dramatically. All of those different parts now converge into the same set of systems and need to work quickly and flawlessly with each other. The server is now the storage and the network and the compute, all in one. All of those separate integration steps need to be considered as part of the system design, not just the ones for the server itself. This is where “whole-solution validation” comes in. I’m going to dive into one of the thorniest and most problematic areas, the storage subsystem.

Off-the-shelf vs vendor supplied

One of the simplest areas to overlook is the differences between retail or channel supplied parts, and their vendor-tuned equivalents. One of the lesser-discussed features of the storage world is how specialized different versions of otherwise identical devices can become, as requested by different vendors. This isn’t anything specific to an OS or an OEM, it’s industry-wide.

Let’s say vendor A has a very popular model of disk. They’ll take that one disk and sell it into the retail channel under their enterprise label. The firmware will be tuned for maximum compatibility, since they don’t know what it will be attached to, and what features are needed or supported. It won’t use aggressive timings. It won’t implement special commands. It won’t do anything fancy that might break in a general configuration. Now, they take that same disk, and sell it to one of their major OEM customers. That OEM has LOTS of requirements. They sell their own SANs. They sell servers. They sell hardened systems that get installed on oil rigs. They sell all kind of things that one disk might need to be plugged into. All of those uses might have their own special needs. They have specialized diagnostics. They have aggressive timings, which allows them to have better performance and latencies than their competitors. They only have 1 or 2 types of HBA they support, so their testing matrix is small. The disk vendor will turn around and sell that device to them with a specialized firmware load that meets all their needs. Now repeat that procedure for a totally different OEM with very different requirements. You now have 3 versions of the same disk, all slightly different. Repeat another few times…you see where this is going, right?

Now, for all intents and purposes those disks look and act like any other disk 99% of the time. After all, there are industry standards around these things, and you’re not going to get many sales for a SATA or SAS disk that won’t ACT like a SATA or SAS disk when addressed by the system. HOWEVER, the 1% of the time it’s slightly different is often when it matters most. When the system is under stress, or when certain commands are issued in a certain order, it can result in behaviors that no one ever explicitly tested for before. OEM A tested for feature X in this scenario. OEM B didn’t test for it at all because they don’t use it that way. Put it into System C, and maybe the firmware on the disk just checks out and never comes back. Now you have a “dead” disk that’s simply an interop bug that no one but YOU would have found, because you’re building from a parts list instead of from a solution list.

You can take that above, very-real example, and apply it across the entire storage chain. You’ll quickly discover that a real compatibility matrix for a small set of physically identical devices multiplies into HUNDREDS of unique configurations. That unfortunate reality is why even though there is a healthy list of parts with the required Software-Defined Data Center AQ’s, we want customers invested in solutions which were designed and tested from end-to-end by our partners and OEMs, rather than just a list of certified devices.

Architectural differences matter too

Now, your first thought after reading that might be, “Fine, I’ll simply use SDDC parts from my preferred OEM, knowing that they’ve been designed to work together.” While that could eliminate certain types of interop bugs, it still leaves the chance for architectural changes that were never tested together as part of an integrated solution design. As an example, mixing SAS and SATA devices on the same storage bus could result in unexpected issues during failures or certain types of I/O. While technically supported by most HBAs and vendors, unless you’ve actually tested the solution that way, you have an incomplete picture of how all the pieces will work together. Another example is that not all SSD’s are created equal. NVMe devices offer a tremendous speed and latency benefit over more traditional SAS and SATA SSDs. With that boost can come a much higher performance ceiling for the entire system, which can result in heavy and unusual I/O patterns for the other storage devices on the system. One configuration using SATA SSD’s and HDD’s may behave very differently the second you swap the SATA SSDs for NVMe ones, despite the fact that all of them may be SDDC certified by one vendor.

This isn’t exclusively a storage problem, either. There are often multiple NICs in an OEM’s catalog that can support high-performance RDMA networks. Some of them will use RoCE, some will use iWARP, likely none of them will be supported by the vendor in a mixed environment, and often they require very specific firmware revisions for different whole solution configurations to ensure maximum reliability and performance. If no one ever tested the whole system as a solution with all the pieces well-defined, from end-to-end, the final reliability of the solution can only be speculated on.

Conclusion

These posts aren’t meant to make blanket statements about the supportability of do-it-yourself HCI and S2D configurations. Assuming your systems have all the proper logos and certifications, and your configuration passes cluster validation and other supportability checks, Microsoft will support you. However, it’s very easy to get caught with a slightly different version of the hardware, firmware, tested drivers, and other components. Building and tracking this in-house is not a trivial task! That’s why we wanted to make it clear why we’re running the Windows Server-Software Defined program, and what benefits you can expect by purchasing one of these configurations for your critical workloads. We feel confident that you will have a better HCI experience with Windows Server over the lifetime of the solution via this program than by building it on your own.

It’s clear to see that Microsoft is ready for prime time with hyper-converged infrastructure, Storage Spaces Direct, Windows Server 2016, and their revolutionary vision of the Windows Server Software Defined data center.

If you have any questions about Storage Spaces Direct or DataON’s WSSD-certified solutions, give us a shout at DataON_Sales@DataONStorage.com.

Leave a Reply