For a blog that gets very little love from me, I’m often surprised that it still gets any traffic. If you found some old content here that you like, and would like to read more stuff like it, then you’ll probably find it in one of two places.
The first will be on the new netapp communities site where I’ll put all my NetApp focussed (and by association, most of my storage technology related) blog posts, I’m currently waiting for a direct URL to be created, and when it is, I’ll update this post with the URL once I have it in a few days.
The second place is on crankbird.com That will focus on the non-netapp stuff that I find interesting and will be much more regular with usually much shorter posts which I hope you’ll find interesting. If you like the way I write, I’d really appreciate it if you’d come check it out and make some comments (unless of course your URL is cheapviagratablets/shoppingcart?-“drop table students -“)
For those of you who read my blogs and come back here looking for more of the same, thank you, I’ll try to give you something better in the future.
John “Ricky” Martin – aka @life_no_borders, aka @JohnMartinIT aka @crankbird
It’s no secret that Cisco has been investing a LOT of resources into cloud, and from those investments, we’ve recently seen them release a corresponding amount of fascinating new technology that I think will change the IT landscape in some really big ways. For that reason, I think this year’s Cisco Live in Melbourne is going to have a lot of relevance to a broad cross section of the IT community, from application developers, all the way down to storage guys like us.
For that reason, I’m really pleased to be presenting a Cisco partner case study along with Anuj Aggarwal who is the Technical account manager on the Cisco Alliance team for NetApp. Most of you wont have met Anuj, but if you’ve got the time, you should take the time to get to know him while he’s down here. Anuj knows more about network security, and the joint work NetApp and Cisco has been doing on hybrid cloud than anyone else I know, and this makes him a very interesting person to spend some time with, especially if you need to know more about the future of networking and storage in an increasingly cloudy IT environment.
I’m excited by this year’s presentation, because it expands on, and in many ways completes a lot of the work Cisco and NetApp have been doing since NetApp’s first appearance at Cisco Live in Melbourne in 2010 as a platinum sponsor, where we presented to a select few on SMT, or more formally Secure Multi-Tenancy.
A business mentor of mine once told me there are only four rational reasons why a company invests its capital, and those reasons are to improve revenue, decrease costs, reduce risk or improve agility. I asked if agility really deserved its own category, and he answered with a quote from Charles Darwin: -
“It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change”
He continued that improving revenue is actually almost arguable, because it’s the one thing over which the company has the least control, and that in a fast changing business environment, you’d be better off investing in agility so you can take advantage of uncertainty.
I was reminded of this recently because it’s been a little over ten years since Nick Carr wrote an article in the Harvard Business Review stating the IT doesn’t matter. I opened with this during NetApp’s recent Elevate conferences in Adelaide and Perth, and pointed out that IT that doesn’t improve top line revenue or a company’s agility is a recipe for a focus on nothing more than cost and risk reductions. I was surprised that my comment still provoked a pretty defensive result in some IT professionals.
As I talked about how IT infrastructure teams could learn a lot from agile software development methodologies, and that a datacenter built on software defined infrastructure would allow this, it struck me what was causing this defensive posturing. Risk management was THE key issue that had to be addressed before any of this could happen. To be sure, costs are important, but without a way of dealing with risk effectively, none of this agile, software defined, cloud nirvana was ever going to happen, or certainly not within the timeframes anyone outside of IT was going to tolerate.
This insight was particularly relevant to me because in IT, vendors talk a lot about private cloud to our customers. We talk about accelerating journeys, we talk about how it’s your cloud, we talk about the benefits and we publish case studies. At the same time our product organizations spend increasingly large amounts of their development time and resources on delivering technology to create service catalogs, analytics capabilities and automation and self-service frameworks.
Internally, and between ourselves in the breaks between presentations at events and conferences, many of us wonder why, despite the clear business benefits and available technology, the adoption rate is much slower than we would have expected, and many companies business units are leapfrogging their IT departments internal cloud developments to go directly to large public cloud offerings.
It wasn’t until I got home and I heard my wife say “That’s awesome, they’re teaching the concept of the undo-key” that I had my real epiphany. What she was talking about was a kickstarter project called Robot Turtles, a board game created by Dan Shapiro of Google that teaches primary school kids the basics of programming. While the concept is awesome, it struck me that the ability to easily undo a mistake so fundamental to Agile software development, that it is one of the first concepts you would teach. It was also the reason why infrastructure agility was something that was talked about far more than it was done. People can’t take the same risks with their data infrastructure that you can with software development, or a word processing document, and the reason is that for almost all of us, there is no genuinely effective equivalent of Control-Z for our infrastructure.
Imagine that, in order to roll back a mistake in a word processing document, that first you had to
- Open up a brand new document
- Copy all the text from the first document and past it into the second document, one paragraph at a time
- Run an macro that read the formatting on the first document
- Paste the results of that macro into the second document
Then if you made a mistake that you had to
- Delete your entire paragraph that had the mistake
- Copy the paragraph from the second document
- Find the portion of the script that had the formatting for the document you just copied back
- Run that portion of the script on the original document, and hope that it doesn’t affect any of the other paragraphs or muck up the indexing or cross referencing
Furthermore, imagine that your copy was usually twelve hours old, and you could only recover your data after you’d received permission via a formal change request that had to be approved by three managers who checked them into the change control systems, then arranged for them to be sent back, buried in soft peat for three years and then finally recycled as firelighters.
Clearly, nobody would use any software program that had those limitations, and yet that’s exactly the kind of thing infrastructure professionals have to deal with on a daily basis. It’s no wonder that their perception of risk management and that of the rest of the business are so different.
Agile methodologies deals with risk in a completely different way, it requires that you build your progress on small iterative steps, and that at the end of each step you gain some insight, which you then turn into action. Continuous testing, and continuous deployment significantly reduce the risks of major project failures previously associated with waterfall methodologies. Even with an entire data-center built on software defined infrastructure, without an easy way of testing new infrastructure builds, and fixing and correcting mistakes early, infrastructure operations will never be able to fully support the kinds of agility the business increasingly demands from IT. So long as internal IT lacks an effective undo-key, they will be stuck in the world of waterfall methodologies, and a cost effective, agile private cloud built on software defined principals will remain a future vision instead of a present day reality.
The nice thing from my perspective is that NetApp uniquely provides a well proven set of tools that provides the fine grained undo that works from a single document on a home drive, all the way up to a petabyte scale data-center. We provide a Control-Z that lets you innovate safely, and realize the benefits of private cloud on technology that is already in production in thousands of data centers.
Future blog posts will concentrate on specific technologies like Snapmirror, SnapCreator, and NetApp Shift and how they create and enable a Universal Data Platform that can be used to eliminate the risk that stands between where virtualization stands today, and a truly agile, hybrid cloud tomorrow.
if you take that 3 step process for creating a “Software Defined” infrastructure that I outlined in my previous post, you could reasonably say that storage has been “software defined” since about 1982 (arguably as early as 1958 when the first disk drive made its appearance)
- Step 1 – identify and then formally define a set of common functions or primitives performed by existing infrastructure that are optimally run in purpose built devices (e.g. hardware filled with interfaces and ASICs) – This becomes the "Data Plane".
- From a data storage perspective I have broken down what I see as the common storage primitives into four main categories. I’ll probably use these categories as a tool for functional comparisons of various Software Defined Storage implementations going into the future.
placement managment – e.g. given an logical address and some data by a requestor, write that data to an underlying storage medium so that it can be subsequently retrieved using that address without the requestor needing to be aware of the physical characteristics of that underlying storage medium
access managment – e.g. given an address by a requestor, read data from an underlying storage medium and make it available to the requestor. Additionally in the case where multiple requestors may make simultaneous requests to place or access the same data, provide a mechanism to arbitrate that access.
copy management – e.g. given a set of source addresses and a range of target addresses, copy the data from the source to the target on behalf of the requestor
persistence management – in most storage systems this is an implied function, though increasingly with the rise of protocols such as CDMI, and XAM, data persistence SLOs are being explicitly defined at placement time. In most cases however, data must be stored until the device itself fails, and the device is generally expected to have a lifetime of multiple years.
- Step 2 – Create a protocol that manages those functions
- The great thing about standards is there’s so many of them … and the storage industry just LOVES forming standards bodies to create new protocols to manage the functions I described above. Many of them have been around for a while: SCSI was standardised in 1982, NFS in 1989, SMB in 1992 (kind of), OSD in 2004 and in more recent times we have seen implementations like XAM in 2010, and most recently CDMI which became an ISO standard in 2012.
Some of us get religious about these standards and which one should be used for what purposes, what I find interesting is that they all seem to be converging around a common set of functionality, so it’s possible that we will eventually see one storage protocol to rule them all, but I doubt it will happen any time soon. In the near term, whether we need to create another new protocol is debatable, but as of this moment I’m pretty impressed with the work being done at SNIA with CDMI, not as a “new replacement” but as something which leverages the work that’s already been done with the other protocols and fills in their gaps, but I’m getting WAY ahead of myself here.
- Step 3 – Create a standards compliant controller that runs on general purpose hardware (e.g. an intel server, virtualised or otherwise) that takes higher order service requests from applications and translates those into the primitives codified in step 1, over the protocol devised in step 2. – This becomes the "Control Plane"
- Well if you accept that the existing storage protocol standards are functionally equivalent to the OpenFlow Protocol in Software defined Networking, then pretty much any modern operating system could function as a controller. Also any modern hypervisor also acts as a controllers, and any storage array which uses SCSI protocols to talk to the disks at the back end also acts as a controller, and in my view this is an accurate description.
- Each of these constructs acts a a standards compliant controller in a software defined storage infrastructure, with multiple levels of encapsulation with consequent challenges that there is significant functional control overlap between these controllers. Over the next few posts I’ll go through what this encapsulation looks like, where the challenges and opportunities are in each level, the design choices we face, and build that up so we can see how close we are to achieving something that matches some of they hype around software defined storage.
It’s also worth noting that until I’ve reached my conclusion, much of what I’ve written and will write will not neatly match up with the analyst definitions of software defined storage. If you bear with me we’ll get there, and probably then some. My hope is that if you follow this journey you’ll be in a better position to take advantage of something that I’ll be referring to as “SLO Defined” storage (simply because I really don’t think that “Software Defined” is particularly useful as a label)
If you want to jump there now and get the analysts views, check out what IDC and Gartner have to say. For example IDC’s definitions of software defined storage from http://www.idc.com/getdoc.jsp?containerId=prUS24068713 says in part
software-based storage stacks should offer a full suite of storage services and federation between the underlying persistent data placement resources to enable data mobility of its tenants between these resources
The Gartner definition which isn’t public, takes a slightly different approach and can be found in their document “2013 Planning Guide: Data Center, Infrastructure, Operations, Private Cloud and Desktop Transformation” where it talks about higher level functionality including the ability for upper level applications to define what storage objects they need with pre-defned SLO’s and then have that automatically provisioned to them. (or at least that is my take after a quick read of the document).
IMHO, both of these definitions have merit, and both go way beyond merely running array software in a VSA, or bundling software management functions into a hypervisor, or pretty much anything else that seems to pass for Software Defined Storage today, which is why I think it’s worth writing about …. In …. Painful …. Detail
As always, Comments and Criticisms are welcome.
Monday is my admin clean-up and research day, which makes it the best day for quadrant-II thinking, and most of what I’ve been thinking about recently is software defined storage, or if you’re an Openstack advocate then you’d call it software-led storage.
After spending more than a few weeks thinking and researching, I’ve come to the conclusion that I’m not a big fan of either term, especially as it pertains to storage. Given the likelihood of an increasingly fuzzy set of layers between hardware and software, I think that “software-led” is probably a more useful way of talking about the future of storage infrastructure, but even so I’m still not convinced it’s the most useful description either. Nonetheless, for the moment a lot of people are talking abut software-defined networks, datacenters and storage, so I’ll start to outline my breakdown of storage within that paradigm.
Software defined anything has its roots in software defined networking and OpenFlow, so the rest of this post goes through how I see Software Defined Networking, and then I’ll use that as a framework in future posts for talking about software defined storage.
So how do you define “Software Defined” I think if you’re going to use the term without it being just another way of saying virtualised, then you need to be talking about infrastructure built on the principal of a clean separation of hardware optmised functions from software control structures, or , in the parlance of Software Defined Networking separating the data plane from the control plane. That means to create something that is truly Software Defined XXXX and not just a marketing-sexy-me-too-rebrand you have to
- identify and then formally define a set of common functions or primitives performed by existing infrastructure that are optimally run in purpose built devices (e.g. hardware filled with interfaces and ASICs) – This becomes the “Data Plane”
- Create a protocol that manages those functions
- Create a standards compliant controller that runs on general purpose hardware (e.g. an intel server, virtualised or otherwise) that takes higher order service requests from applications and translates those into the primitives codified in step 1, over the protocol devised in step 2. – This becomes the “Control Plane”
The prime example of this is with software defined networking world that could be something that looks like this …
So why did this happen in networking, and not storage or compute ? Why now ? And why bother ?
As to why it happened in networking, there are a half a bazillion blogs out there on the subject, of which I’ve only read a small fraction, but from my perspective, I reckon it happened because of the following reasons
- by its very nature, customers have demanded that networking vendors must inter-operate with other vendors equipment in as seamless a fashion as possible
- there has been one absolutely dominant player in the market at pretty much all times along with a very well supported standards body.
- Networking subsequently evolved to the point where there is one dominant layer-2 implementation (Ethernet) with one dominant layer-3 implementation (IP), and a fairly small number of upper level protocols above that (TCP/UDP/HTTP etc).
- This has driven the similarity of network equipment functionality from disparate vendors that allowed the developers of openflow the opportunity to identify the commonality of flow-tables in hardware on which the elegant separation of control and data planes in SDN is built.
Like many “new and revolutionary ideas”, it probably worth noting, that this revolutionary “new” architecture has been evolving since at least 2001 when the IETF started the “Forwarding and Control Element Separation” (ForCES) working group”, and arguably before than back to 1996 with things like General Switch Management Protocol (GSMP).
But even if you can do this clean separation, why bother ? The development of openflow wasn’t driven by market requirements, it was developed to let researchers run interesting experiments on existing large scale university campus networks. While that’s a very cool thing to do as a researcher, running “experiments” on a large scale enterprise infrastructure isn’t something I’ve ever had much success with. About as adventurous as I get is asking for a vlan that spans two datacentres, and for the most part whenever I’ve suggested stuff like that in the past, I get one of those “Put the network diagram down … and STEP AWAY” looks from the network guys. I can only imagine what would happen if I said “Hey I’ve got this really cool idea for encapsulating fibre channel over token ring and running it on your existing Ethernet infrastructure”. Which begs the question, why on earth would anyone in Enterprise-IT implement want to implement something this radical ?
The answer for the most part is .. they don’t. Sure there is a promise that opening up the infrastructure will lead to more competition and that will reduce prices, but the last time I looked, the networking industry was already pretty competitive. Even of you were to pull a datacentre class switch apart into cheap basic hardware and smart software running on an Intel-box, the value that vendors like Cisco bring in terms of scalability, quality assurance, interoperability testing, support, professional services etc, will mean that in all likelihood, customers will be willing to pay a premium for their solutions, and Cisco and others like them may become even more profitable as a result. As a parallel case, there are plenty of free database offerings out there, and yet Oracle is doing just fine. You might expect that if “Software Defined” was something everyone now uses as a prime buying criteria you might see Larry Ellison extolling the virtues of a “Software Defined Database”. OK, maybe not given his rather sceptical comments about cloud in the early days, but they’re going in exactly the opposite direction, increasingly embedding more of their software into vertically integrated hardware solutions precisely because there is continued ongoing demand from enterprise customers for tightly integrated hardware/software solutions.
So if SDN isnt likely to significantly reduce costs, and there isnt an organic pent up demand within the enterprise, then where is the payoff for the large risks that come with developing and deploying any new technology ?
The answer to that question lies in the standardisation and maturity of today’s network protocols that led to the commonality expressed in flow-tables. The core protocols of TCP/IP were developed almost forty years ago and were built not only on a set of solid principals that have stood the test of time, but also on what were in 1973, some very reasonable assumptions. Unfortunately some of those assumptions no longer hold true e.g. there was an assumption that a machine with an IP address wont magically teleport from one physical location to another, yet this is exactly what happens when you try to migrate a virtual machine from one datacenter to the next. It is exactly those kinds of assumptions that are now causing problems the largest consumers of Network equipment: the large-scale cloud and telecom service providers.
That is why Software Defined Networking is suddenly interesting. For many businesses, IT infrastructure isn’t a competitive differentiator (it could be, and it should be, but right now it isn’t), but there are some very large customers, with some very large IT budgets for whom IT infrastructure is a core enabler of their business, and are willing to take on the risk of a new approach in the promise of disruptive innovation. These people aren’t just dreamers with fists full of VC dollars, but some of the networking industries largest and most influential customers. Other agile enterprise customers who understand how to leverage IT infrastructure for competitive advantage will also benefit from the investments of these larger organisations, but for the most enterprise customers, what passes today for software defined Networking will be restricted to virtual switches inside their virtual server infrastructure, and that, while useful, doesn’t exactly fit the definition I used at the beginning of this post.
Which brings me to storage .. For a number of vendors, a Virtual Storage Array = Software Defined storage, and while that’s reasonably valid, I also think it’s a bit of a half measure. I’m not saying that because I don’t like VSA’s, I do, I think they’re great, but, I don’t think that they’re the best example of what a software defined storage infrastructure can do. They might be part of it, but they’re not a necessary part, and in some cases, I’d argue that they’re not even a desirable part of a cleanly separated software defined infrastructure. And that is what I’m going to cover in my next post.
I was reading a blog post by Duncan Epping here http://www.yellow-bricks.com/2013/04/24/re-is-vsa-the-future-of-software-defined-storage-openi/ around VSA’s and software defined storage, and put in one of my usually overly long replies when I thought it might make a reasonable blog post here, because it outlines a number of my key thoughts on this which I was planning on writing about later on. If you get the chance, read Duncan’s post as theres some good stuff in the main blog as well as some interesting comments.
The following was my reply with some typo cleanup …
IMHO VSA’s will be an important part of the software defined storage (SDS) landscape but by no means are they the complete story. What is lacking in SDS is the equivalent of flow-tables in switches. If you go with the whole “separate the control plane from the data plane” definition of software defined anything, then you could reasonably argue that this is exactly what things like the VERITAS volume manager and file system did way back in the 80’s. For a whole stack of good reasons people chose to bifurcate that responsibility of managing that functionality increasingly into the storage and application layers, leaving those product with increasingly niche roles. The advent of SDS might change swing that pendulum back towards 80’s style architecture for a while, but people tend towards vertically integrated solutions when the complexities of managing and integrating solutions themselves becomes economically unviable, and designing a reliable storage solution with high performance at large scale that caters for a large variety or workload types is very very hard to do well.
Going back to the lack of a storage equivalent of flow-tables, the trouble with SDS is that storage requirements are much less homogenous than switching requirements and much harder to bring down to a small number of discrete functions that can be acclerated in hardware. I think that over time these will become more obvious, the first and most obvious of which is copy offload/management, but these requirements will probably evolve over time.
Rather than focus on building an industry/standards defined theoretical model, and trying to wedge/judge all the designs by that model, I think we’d be better served by loosening up the vertical integration of storage systems and then finding a variety of creative ways of leveraging large amounts of cheap CPU/Memory/Cache/Disk sitting in the virtualisation layer. VSA’s are a fairly coarse grained way of achieving this, but many of them don’t elegantly leverage tightly/vertically integrated infrastructure to accelerate or drive efficiencies where that is appropriate.
For example, there are ways of using the hypervisor resources as a “data plane” and leaving the control plane in the centralised array, such as NetApp FlashAccel . This is kind of counter-intuitive to the existing “control-plane lives in the hypervisor” model as the cache is seen as an extension of the hardware array rather than the array being seen as an extension of the hypervisor. To be fair the model isn’t that pure, as control portions are distributed between the array and they hypervisor. My point is that the boundaries become a lot fuzzier, and will be functionality will be divided and combined in a variety of interesting ways, and so long as storage is asked to perform so many different tasks, I think that’s a good thing.
While I love VSAs as a conceptually neat little package of functionality with tightly defined boundaries, (The DataONTAP VSA’s in particular, especially if you’re aware of their roadmap) I think that data and storage management will for the foreseeable future be a shared responsibility between applications, hypervisors, operating systems and arrays. The biggest challenge we face is co-ordinating these responsibilities and choosing the most efficient and automatable ways of combining them to give customers what they need without needlessly locking them into inflexible architecture choices.
There have been some really interesting discussions within the NetApp technical community around the benefits of NFS vs Fibre Channel for Oracle and VMware workloads, much of which I’m planning on plagiarizing and passing off as my own work on this blog, but in the interests of trying to keep the small shreds of integrity I have left intact, I thought I’d post up my contribution to the conversation first, because a friend of mine suggested the issues raised in the rant would make a good blog post
This response came after a series of posts on whether you needed to put in a dedicated ethernet network to carry the storage traffic, which would tend to diminish much of the “Ethernet Infrastructure is Cheaper” argument, or whether you can simply carve out a VLAN instead, what we considered to be best practice, and whether that matches what happens in a FlexPod.
My Response (slightly edited where you see the [ ...] but still including the rant meta-tags that I feel should be made an official part of HTML 6) follows
[You can just use] a VLAN, but you’d want to make sure the underlying Ethernet network was designed with storage style SLO’s eg. Not oversubscribed, no single points of failure, non-disruptively upgradable, port channels set up correctly, Jumbo frames set up correctly etc.
Using Ethernet networks designed for typical “user LAN” SLO’s and expectations for a storage network will probably screw up something at some time and is IMHO the reason why iSCSI and NFS still gets a bad reputation.
There are a number of people out there in customer network teams [....] that think they know how to design a reliable high performance Ethernet network with storage level SLO’s and many of those people overestimate their capabilities. For that reason storage teams have rightfully been wary of the network team and use FC infrastructure as an excuse to keep them from fiddling about in stuff they don’t really understand. Ethernet based storage networks make that much harder to do, but running up a brand new network design/implementation specifically for storage is still a reasonably good idea unless there is a good reason to believe the existing network has been designed with storage class SLOs in mind.
That is partly the reason why in a FlexPod, the only thing you don’t flex is the network configuration. In a FlexPod you can safely carve out a VLAN, apply the appropriate COS to it (as specified in the design guide) and be confident that you’re going to get a good result, there’s almost 40Gbit of bandwidth available in the current design, so it’s pretty unlikely you’re going to run out of that resource in a hurry, and if you do … buy another FlexPod.
Personally I’m a strong believer in multi-purpose Ethernet based networks based on 10Gbit/40Gbit with storage class SLO’s. The economic benefits go way beyond just reducing the costs of FC infrastructure and goes to the principal of large pools of [stateless] physical resources [from which virtual constructs are] provisioned automagically on demand, and this ties in really nicely with [NetApp's] overall value proposition. Cisco, not surprisingly have a pretty good pitch around this, and [it would be worth working with them on this]. Furthermore, it also allows you to get involved in the ever increasingly sexy subject of software defined networking.
Transformation of the network infrastructure gets LOTS of attention at the C-Level and providing solid justifications / use cases for that can help [the customer accelerate their justifications for a datacenter transformation projects].
Because of that I would never be shy about pushing NFS (or going forward SMB3) on Ethernet as a superior solution to FC, especially with things like RoCEE on the horizon as it might just be the thing that disrupt [legacy methods] as it has done in some of NetApp’s largest accounts many times over.
For me supporting FC (or even iSCSI) is more about helping a customer to protect the investment they’ve already made in legacy infrastructure while they migrate to something more efficient and flexible.
Finally this is less about NFS vs FC than it is about FlexVols vs LUNS, and LUNs are really DUMB storage containers … don’t encourage people to keep using them when they have a better alternative
I’d be interested on any perspectives or particular interests, because I’m digging through some of our internal and customer facing documents to summarise exactly what the benefits are, where the limits are, and what you need to do to implement an ethernet network with storage class SLOs at small and large scales.