Seeing Through the Smoke and Mirrors of the Hyper-V/QLogic Storage “Benchmark”


Last Wednesday QLogic announced what appeared to be a very impressive benchmark - QLogic Achieves Near-Native Fibre Channel I/O Performance On Windows Server 2008 Hyper-V. By near native performance, QLogic highlighted throughput of nearly 200,000 IOPs. Naturally such a high throughput in a virtualized environment caught my attention. The announcement was timed to go along with the Hyper-V RTM announcement and immediately validate storage I/O performance of Hyper-V connected to SAN storage using QLogic 8Gb fibre channel host bus adapters (HBAs). I’ve always liked benchmarks if they can set relative expectations for how a particular configuration will perform in a typical environment. When the environment is far from typical, I consider the benchmark either an academic exercise (let’s see how far we can push this thing, regardless of how unrealistic the configuration may be) or a crafty attempt at product marketing. If I was to place this particular benchmark into one of Nik Simpson’s benchmarking categories, I’d have to say it falls into the benchmarketing category.

The QLogic press release included the following quote from Microsoft’s Mike Schultz:

QLogic’s benchmark result surpasses the existing benchmark results in the market, and demonstrates that Windows Server 2008 Hyper-V customers can achieve higher server utilization rates and consolidate servers with great technical performance.

The statement “surpasses the existing benchmark results in the market” implies that the Hyper-V/QLogic benchmark has outperformed a comparable VMware benchmark. The press release was careful to state the hypervisor and fibre channel HBA (QLogic 2500 Series 8Gb adapter), but failed to mention the back end storage configuration. I consider this to be an important omission. After some digging around, I was able to find the benchmark results here. If I was watching an Olympic event, this would be the moment where after thinking I witnessed an incredible athletic event, I learned that the athlete tested positive for steroids. Microsoft and QLogic didn’t take a fibre channel disk array and inject it with Stanzanol or rub it with “the clear,” but they did use solid state storage. The storage array used was a Texas Memory RamSan 325 FC storage array. The benchmark that resulted in nearly 200,000 IOPS, as you’ll see from the diagram, ran within 90% of native performance (180,000 IOPS). However, this benchmark used a completely unrealistic block size of 512 bytes (a block size of 8K or 16K would have been more realistic). The benchmark that resulted in close to native throughput (3% performance delta) yielded performance of 120,426 IOPS with an 8KB block size. No other virtualization vendors have published benchmarks using solid state storage, so the QLogic/Hyper-V benchmark, to me, really hasn’t proven anything. Furthermore, the published benchmark fails to reveal latency numbers, which has been the most useful value of storage performance in virtualized environments. Applications can be very sensitive to I/O latency, and it’s import to disclose latency numbers in any storage benchmark.

For further clarity, I ran these results by a colleague well-versed in performance testing and this was his response:

In a storage stack, the number of concurrent I/Os is typically a limit at certain choke points, i.e., the virtual device, the queue between the guest and the parent OS, and the drivers in the parent. The recent Microsoft benchmark used an I/O depth of just 64, but with an SSD the latency is very small, so at 0.3ms per I/O with an SSD, it’s possible to generate 210,000 IOPS in theory at 0.3ms with 64 outstanding I/Os.

However, to properly demonstrate 180,000 real IOPS would require 1,200 concurrent I/Os, rather than the 64 used.

With real disks, the same 64 concurrent I/Os at 7ms each would limit throughput to 64 * 1/.007 = 9,142 IOPS!

To me, these exercises in smoke and mirrors trickery (i.e. solid state storage in a hypervisor storage performance “benchmark”) yield more questions than answers. In addition, I’m left questioning future benchmarks produced by vendors that use such tactics. Vendors - if you are going to go as far as issuing a press release based on a “benchmark,” please give us an honest assessment of a real world environment. Anything else simply casts doubt on your future performance numbers and adds to the already prolific cynicism surrounding vendor benchmarks.

  1. #1 by Scott Lowe - June 30th, 2008 at 11:14

    Great catch, Chris. I, too, share your request for honesty and completeness in disclosing vendor benchmark environments. At least then the readers of the benchmark can determine for themselves the validity of the benchmark and/or its applicability to their own environments.

  2. #2 by Chris - June 30th, 2008 at 15:24

    Thanks, Scott. I agree. At the end of the day, if you’re going to issue a press release about a benchmark, at least do so in full disclosure and provide the test case to back up your findings. If that was the case, I think my reaction would have been different.

  3. #3 by Harald Ums - July 1st, 2008 at 06:44

    I do not share your criticism:
    you want to demonstrate the speed of your devices, the you avoid any other bottlenecks: so you use RamSan.
    you want to show transitions to and from your VM do not matter, then you use a blocksize that uses a lot of transitions: 512 bytes

    This benchmark is there to test VM overhead not physical disk overhead

  4. #4 by Chris - July 1st, 2008 at 07:15

    I understand your points, Harald. However, a part of my job with Burton Group is customer advocacy. What had got me started down the path to blogging about this benchmark was the fact that it was issued through a press release without the test case released to back it up. And I’m not just picking on Microsoft here. For well over a year, I had been critical of VMware benchmarks and the fact that they were not showing scalability tests. Too many vendors provide benchmark results that involve running a single VM on a single physical host (I’m assuming that’s the case with the Microsoft/QLogic benchmark). I don’t think you’ll find a VMware benchmark published in the last couple of months that does not include scalability results. If you want to prove the performance of the hypervisor, you have to do so under a real workload. Benchmarking the performance of 1 VM on 1 host does not accurately reflect the scheduling work that the hypervisor needs to do, so to me this is not a true reflection of VM/hypervisor performance. Show me scalability up to 8 VMs and I’m a believer, since consolidation ratios of 8:1 to 12:1 have been pretty typical. When I see benchmarks that are completely absent of any type of real world workload, I’m going to bring attention to them. I may take a little heat for it, but that’s OK.

  5. #5 by rick - July 1st, 2008 at 22:38

    Also take a look at the recently published VMware benchmarks of pushing over 100,000 IOPS from one host: http://blogs.vmware.com/performance/2008/05/100000-io-opera.html

    I believe they were using a traditional SAN with actual spinning disks in this case!

  6. #6 by Tim - July 7th, 2008 at 16:16

    I think you are off on this one Chris…

    Road and Track readers don’t want to know how fast a Ferrari can go from zero to 60 through a populated city block with stop signs and traffic signals. Readers want to know the pure power of the Ferrari on a controlled track. I believe this is what Microsoft has done for us.

    Further more, you’re touting real world scenario yet you failed to mention the additional data listed from the link within your blog. This link contains additional information on workload and application or “real world” data.

  7. #7 by Chris - July 13th, 2008 at 08:58

    Good points, Tim. I fully understand the practice. The benchmark methodology has been in place for several years and makes perfect sense on a physical server. So to your point on solid state storage, I agree. However, to claim that you’re faster than any previous published benchmark when no previously published benchmark used solid state storage is a bit much. To me, that’s a significant omission. Also, experience has shown that any benchmark with a single VM on a single host does not reflect a true scheduling load on the hypervisor and thus is not accurate. Of course, I can only assume that no scalability tests were conducted. I had asked QLogic for a copy of the full test case and thus far have yet to receive it.

  8. #8 by James O’Neill - July 17th, 2008 at 08:46

    Chris, I find myself agreeing and disagreeing with you at the same time.

    The purpose of this benchmark is to prove - if it can be proved - that Hyper-V is not an I/O bottle neck. I read the numbers and said “What the hell kind of system can do 200,000 IOPs per second” it was plainly not the kind of system which is going be installed in many environments. It allows Microsoft people to shout “B.S.” at the top of their lungs if anyone from VMware claims to have drivers which are much better than Windows ones. It also kills any suggestion that Hyper-v and Windows drivers are OK in small systems but don’t scale.

    You’re right that if a Microsoft benchmark says “runs at 90% of the speed of RAW hardware” the intelligent question to ask is “is that better, worse or about the same as the competition”. Is it “Faster than any previous benchmark on virtulization” because it got a better percentage of the hardware or because it kept on scaling when the hardware improved ? Either would be a win for Microsoft. Just saying “ya boo sucks … we’re faster than you ” isn’t.

  9. #9 by Chris - July 17th, 2008 at 09:59

    Great comments, James. I completely agree. All that is missing, in my opinion, is a good storage I/O scalability results. Show me the throughput/latency with one VM per host, then do the same with four and eight VMs per host. To me, that would tell the full story and give Microsoft every right to shout “B.S” at competitive claims about inferior performance. Once SPEC wraps up its port of VMmark as the new SPEC Virtualization industry standard benchmark, hypervisor performance comparisons should be much easier.

  10. #10 by John - July 22nd, 2008 at 12:04

    Rick,

    The VMware test also used 3 EMC CX3-80s, which would have taken up nearly 3 full racks. The RamSan 325c takes up 3U.

(will not be published)