–This has been revised based on some comments I’ve received since the original posting, check the comment thread if you’re interested what/why–
I came in this morning with an unusually clear diary, and took the liberty to check the newsfeeds for NetApp and EMC, this is when I came across an EMC press release entitled “EMC VNX SETS PERFORMANCE DENSITY RECORD WITH LUSTRE —SHOWCASES “NO COMPROMISE” HPC STORAGE“.
I’ve been doing some research on Lustre and HPC recently, and that claim surprised me more than a little, so I checked it out, maybe there’s a VNX sweetspot for HPC that I wasnt expecting. The one thing that stood out straight away was . “EMC® is announcing that the EMC® VNX7500 has set a performance density record with Lustre—delivering read performance density of 2GB/sec per rack” (highlight mine)
In the first revision of this I had some fun pointing out the lameness of that particular figure, (e.g. “From my perspective, measured on a GB/sec per rack, 2GB/sec/rack is pretty lackluster”) , but EMC aren’t stupid (or at least their engineers aren’t, though I’m not so sure about their PR agency at this point), so it turns out that this was one of those things where it seems that EMC’s PR people didn’t actually listen to what the engineers were saying, and it looks like they’ve missed out a small but important word, and that word is “unit”. This becomes apparent if you take a look at the other stuff in that press release “8 GB/s read and 5.3 GB/s write sustained performance, as measured by XDD benchmark performed on a 4U dual storage processor”. This gives us 2GB/sec/rack unit which actually sounds kind of impressive. So let’s dig a little deeper, what we’ve got is a 4U dual storage processor that gets some very good raw throughput numbers, about 1.5x, or 150% faster in fact on a “per controller” basis than the figures used on the E5400 press release I referenced earlier, so on that basis I think EMC has done a good job. But this is where the PR department starts stretching the truth again by leaving out some fairly crucial pieces of information. Notably that crucial information that the 2GB/sec/rack unit is for 4U controller is a 2U VNX7500SPE with 2U standby power supply which is required when the 60 drive dense shelves are used exclusively (which is the case for the VNX Lustre Proof of Concept information shown in their brochures), and this configuration doesn’t include any of the rack units required for the actual storage. Either that, or its a 2U VNX7500SPE with a 2U shelf , and no standby power supply that seems to be mandatory component of a VNX solution, and I cant quite believe that EMC would do that.
If we compare the VNX to the E5400, you’ll notice that controllers and standby power supplies alone consume 4U of rack space without adding any capacity, whereas the E5400 controllers are much much smaller, and they fit directly into a 2U or 4U disk shelf (or DAE’s in EMC terminology) which means a 4U E4500 based solution is something you can actually use, as the required disk capacity is already there in the 4U enclosure.
Lets go through some worked calculations, to show how this works. In order to add capacity in the densest possible EMC configuration, you’d need to add an additional 4RU shelf with 60 drives in it. Net result 8RU, 60 drives, and up to 8 GB/s read and 5.3 GB/s write (the press release doesn’t make it clear whether a VNX7500 can actually drive that much performance from only 60 drives, my suspicion is that it cannot, otherwise we would have seen something like that in the benchmark). Meausred on a GB/s per RU basis this ends up as only 1 GB/sec per Rack Unit, not the 2 GB/sec per Rack Unit which I believe was the point of the “record setting” configuration. And just for kicks as you add more storage to the solution that number goes down as shown for the “dual VNX7500/single rack solution that can deliver up to 16GB/s sustained read performance” to about 0.4 GB/sec per Rack Unit. Using the configurations mentioned in EMC’s proof of concept configuration you end up with around 0.666 GB/sec per Rack Unit, all of which are a lot less than the 2 GB/sec/RU claimed in the press release
If you wanted to have the highest performing configurations using a “DenseStak” solution within those 8RU with an E5400 based Lustre solution, you’d put in another e5400 unit with an additional 60 drives Net result 8RU, 120 drives, and 10 GB read and 7 GB/sec write (and yes we can prove that we can get this kind of performance from 120 drives). Meausured on a GB/s per RU basis this ends up as 1.25 GB/sec per Rack Unit. That’s good, but its still not the magic number mentioned in the EMC press release, however if you were to use a “FastStak” solution, those numbers would pretty much double (thanks to using 2RU disk shelves instead of 4RU disk shelves) which would give you controller performance density of around 2.5 GB/sec per Rack Unit.
Bottom line, for actual usable configurations a NetApp solution has much better performance density using the same measurements EMC used for their so called “Record Setting” benchmark result.
In case you think I’m making these numbers up, they are confirmed in the NetApp whitepaper wp-7142 which says
The FastStak reference configuration uses the NetApp E5400 scalable
storage system as a building block. The NetApp E5400 system is designed
to support up to 24 2.5-inch SAS drives, in a 2U form factor.
Up to 20 of these building blocks can be contained in an industry-standard
40U rack. A fully loaded rack delivers performance of up to 100GB/sec
sustained disk read throughput, 70GB/sec sustained disk write throughput,
and 1,500,000 sustained IOPS.
According to IDC, the average supercomputer produces 44GB/sec,
so a single FastStak rack is more than fast enough to meet the I/O
throughput needs of many installations.
While I’ll grant that this result is achieved with more hardware, it should be remembered that the key to good HPC performance is in part about the ability to efficiently throw hardware at a problem. From a storage point of view this means having the ability to scale performance with capacity. In this area the DenseStak and FastStak solutions are brilliantly matched to the requrements of, and the prevailing technology used, in High Performance Computing. Rather than measuring on a GB/sec/rack unit I think a better measure would be “additional sequential performance per additional gigabyte”. Measured on a full rack basis, the NetApp E5400 based solution ends up at around 27MB/sec/GB for the DenseStak, or 54MB/sec/GB for the FastStak. In comparison, the fastest EMC solution as referenced in the “record setting” press release comes in at about 10MB/sec of performance for every GB of provisioned capacity or about 22MB/sec/GB for the configuration in the proof of concept brochure . Any way you slice this, the VNX just doesn’t end up looking like a particularly viable or competetive option.
The key here is that Lustre is designed as a scale out architecture. The E5400 solution is built as a scale out solution by using Lustre to aggregate the performance of the multiple carfully matched E5400 controllers, whereas the VNX7500 used in the press release is relatively poorly matched scale-up configuration which is being shoe-horned into use case it wasn’t designed for.
In terms of performance per rack unit, or performance per GB there simply isn’t much out there that comes close to a E5400 based Lustre solution, certainly not from EMC, as even Isilon, their best Big Data offering, falls way behind. The only other significant questions that remain are how much do they cost to buy, and how much power do they consume ?
I’ve seen the pricing for EMC’s top of the range VNX 7500, and its not cheap, its not even a little bit cheap, and the ultra-dense stuff shown in the proof of concept documents is even less not cheap than their normal stuff. Now I’m not at liberty to discuss our pricing strategy in any detail on this blog, but I can say that in terms of “bang per buck”, the E5400 solutions are very very competetive, and the power impact of the E5400 controller inside of 60 drive dense shelf is pretty much negligible. I don’t have the specs for the power draw on a VNX7500 and its associated external power units , but I’m guessing it adds around as much as a shelf of disks, the power costs of which add up over the three year lifecycle typically seen in these kinds of environments.
From my perspective the VNX7500 is a good general purpose box, and EMC’s engineers have every right to be proud of the work they’ve done on it, but positioning this as a “record setting” controller for performance dense HPC workloads on Lustre, is stretching the truth just a little too far for my liking. While the 10GB/sec/rack mentioned in the press release might sound like a lot for those of us who’ve spent their lives around transaction processing systems, for HPC, 10GB/sec/rack simply doesnt cut it. I know this, the HPC community knows this, and I suspect most of the reputable HPC focussed engineers in EMC also know this.
It’s a pity though that EMC’s PR department is spinning this for all they’re worth ; I struggle with how they can possibly assert that they’ve set any kind of performance density record for any kind of realistic Lustre implementation, when the truth is that they are so very very far behind. Maybe their PR dept has been reading 1984, because claiming record setting performance in this context requires some of the most bizarre Orwellian doublespeak I’ve seen in some time.
So today I found out that we’d broken a few records of our own few days ago, which was, at least from my perspective associated with surprisingly little fanfare with the associated press release coming out late last night. I’d like to say that the results speak for themselves, and to an extent they do. NetApp now holds the top two spots, and four out of the top five results on the ranking ladder. If this were the olympics most people would agree that this represents a position of PURE DOMINATION. High fives all round, and much chest beating and downing of well deserved delicious amber beverages.
So, apart from having the biggest number (which is nice), what did we prove ?
Benchmarks are interesting to me because they are the almost perfect intersection of my interests in both technical storage performance and marketing and messaging. From a technical viewpoint, a benchmark can be really useful, but it only provides a relatively small number of proof points, and extrapolating beyond those, or making generalised conclusions is rarely a good idea.
For example, when NetApp released their SPC-1 benchmarks a few years ago, it proved a number of things
1. That under heavy load which involved a large number of random writes, a NetApp arrays performance remained steady over time
2. That this could be done while taking multiple snapshots, and more importantly while deleting and retiring them while under heavy load
3. That this could be done with RAID-6 and with a greater capacity efficiency as measured by RAW vs USED than any other submission
4. That this could be done at better levels of performance than an equivalently configured commonly used “traditional array” as exemplified by EMCs CX3-40
5. That the copy on write performance of the snapshots on an EMC array sucked under heavy load (and by implication similar copy on write snapshot implementations on other vendors arrays)
That’s a pretty good list of things to prove, especially in the face of considerable unfounded misinformation being put out at the time, and which, surprisingly is still bandied about despite the independently audited proof to the contrary. Having said that, this was not a “my number is the biggest”, exercise which generally proves nothing more than how much hardware you had available in your testing lab at the time.
A few months later we published another SPC-1 result which showed that we could pretty much doubl the numbers we’d achieved in the previous generation at a lower price per IOP with what was at the time a very competetive submission.
About two years after that we published yet another SPC-1 result with the direct replacement for the controller used in the previous test (3270 vs 3170). What this test didnt do was to show how much more load could be placed on the system, what it did do was to show that we could give our customers more IOPS at a lower latency with half the number of spindles . This was the first time we’d submitted an SPC-1e result which foucussed on energy efficiency. It showed, quite dramatically how effective our FlashCache technology was under a heavy random write workload. Its interesting to compare that submission with the previous one for a number of reasons, but for the most part, this benchmark was about Flashcache effectiveness.
We did a number of other benchmarks including Spec-SFS benchmarks that also proved the remarkable effectiveness of the Flashcache technology, showing how it could make SATA drives perform as better than Fibre channel drives, or dramatically reduce the number of fibre channel drives required to service a given workload. There were a couple of other benchmarks done which I’ll grant were “hey look at how fast our shiny new boxes can run”, but for the most part, these were all done with configurations we’d reasonably expect a decent number our customers to actually buy (no all SSD configurations).
In the mean time EMC released some “Lab Queen” benchmarks, at first I thought that EMC were trying to prove just how fast their new X-blades were for processing CIFS and NFS traffic. They did this by configuring the back end storage system in such a rediculously overengineered way as to remove any possibility that they could cause a bottleneck in any way, either that or EMC’s block storage devices are way slower than most people would assume. From an engineering perspective I think they guys in Hopkington who created those X-blades did a truly excellent job, almost 125,000 IOPS per X-Blade using 6 CPU cores is genuinely impressive to me, even if all they were doing was processing NFS/CIFS calls. You see, unlike the storage processors in a FAS or Isilon array, the X-Blade, much like the Network Processor in a SONAS system, or an Oceanspace N8500 relies on a back end block processing device to handle RAID , block checksums, write cache coherency and physical data movement to and from the disks, all of which is non-trivial work. What I find particularly interesting is that in all the benchmarks I looked at for these kinds of systems, the number of back end block storage systems was usually double that of the front end, which infers to me either that the load placed on back end systems by these benchmarks is higher than the load on the front end, or more likely that the front end / back end architecture is very sensitive to any latency on the back end systems which means the back end systems get overengineered for benchmarks. My guess is after seeing the “All Flash DMX” configuration is that Celerra’s performance is very adversly affected by even slight increases in latency in the back end and that we start seeing some nasty manifestations of little law in these architectures under heavy load.
A little while later after being present at a couple of EMC presentations (one at Cisco Live, the other at a SNIA event, where EMC staff were fully aware of my presence), it became clear to me exactly why EMC did these “my number is bigger than yours” benchmarks. Ther marketing staff at corporate created a slide that compared all of the current SPC benchmarks in a way that was accurate, compelling and completely misleading all at the same time, at least as far as the VNX portion goes. Part of this goes back to the way that vendors, including I might say Netapp, use an availability group as a point of aggregation when reporting peformance numbers, this is reasonably fair as adding Active/Active or Active/Passive availability generally slows things down due to the two phase commit nature of write caching in modular storage environments. However, the configuration of the EMC VNX VG8 Gateway/EMC VNX5700 actually involves 5 separate availability groups (1xVG8 Gateway system with 4+1 redundancy, and and 4x VNX5700 with 1+1 redundancy). Presenting this as one aggregated peformance number without any valid point of aggregation smacks of downright dishonesty to me. If NetApp had done the same thing, then, using only 4 availabilty groups, we could have claimed over 760,000 IOPS by combining 4 of our existing 6240 configurations, but we didnt, because frankly doing that is in my opinion on the other side of the fine line where marketing finesse falls off the precipice into the shadowy realm of deceptive practice.
Which brings me back to my original question, what did we prove with our most recent submissions, well three things come to mind
1. That Netapp’s Ontap 8.1 Cluster mode solution is real, and it performs briliiantly
2. It scales linearly as you add nodes (more so than the leading competitors)
3. That scaling with 24 big nodes gives you better performance and better efficiency than scaling with hundreds of smaller nodes (at least for the SPEC benchmark)
This is a valid configuration using a single vserver as a point of aggregation across the cluster, and trust me, this is only the beginning.
As always, comments and criticism is welcome.
I’m in the middle of digesting what was actually released in EMC’s recent launch. For the most part there isn’t anything really that new: lots of unsupported hype like, “3 times simpler, 3 times faster.” Faster than what, exactly? From a technical perspective the only thing that’s really interesting or surprising is the VNXe and that was less interesting than I expected because I thought they were going to refresh their entire range using that technology. So it looks like they’ve given up trying to make that scale for the moment.
So much of what they’ve done copies or validates what we’ve already done at NetApp:
- Simplified software packaging
- Launching a lot of stuff at the same time
- New denser shelves with small form-factor drives
- An emphasis on storage efficiency
- An emphasis of flash as a caching layer
- The ideal match between unified storage and virtualized environments
The biggest change that I see is that they now appear to be shipping all their controllers with unified capability from the start, enabled via a software upgrade which is something EMC has criticised us for in the past. Now they acknowledge that the only way to compete with NetApp effectively is to try to be as much like us as they possibly can. This might explain why EMC in Australia isn’t going to sell the “Block only” VNX 5100. SearchStorage.com.au had this report:
EMC’s new VNX 5100 (pictured), a block-only storage device, won’t go on sale in Australia becaus “We did not see great enough demand to see that particular system,” according to Mark Oakey, the company’s Marketing Manager for Storage Platforms in Australia and New Zealand. “We’ll continue with the Clariion CX4 120,” he told SearchStorage ANZ. “It has more or less the same capabilities.”
Most of the interesting capabilities they’re touting came last year with FLARE 30 and DART 6.0 (two of their operating systems). Even the VMax stuff they’re pushing during the launch came out via a software upgrade without a lot of fanfare in December, so as far as I can see their “record breaking announcement” consists of announcing a whole bunch of things they’d already done along with some new tin.
Things they didn’t announce:
- Multistore equivalency
- V-Series equivalency
- Unified replication capabilities
- A commercial grade VMware based “Virtual Storage Array”- The new low end box is based on Linux
- A scale out roadmap for their “Unified” platform
- Any significant change in their management software strategy or offering
- Block level deduplication for their unified arrays
- Clarification on where their newly acquired scale out Isilon systems fit within their new “Unified” ecosystem.
Overall EMC did a catch up release to try and maintain pace with NetApp innovation, and nothing they’ve done or released represents a significant new threat. If this is
“the most significant midrange announcement in EMC’s 30-year history”
according toi Rich Napolitano, President, Unified Storage Division at EMC, then EMC will continue to play catch up as NetApp redefines Unified Storage and its role in shared infrastructure.