A business mentor of mine once told me there are only four rational reasons why a company invests its capital, and those reasons are to improve revenue, decrease costs, reduce risk or improve agility. I asked if agility really deserved its own category, and he answered with a quote from Charles Darwin: –
“It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change”
He continued that improving revenue is actually almost arguable, because it’s the one thing over which the company has the least control, and that in a fast changing business environment, you’d be better off investing in agility so you can take advantage of uncertainty.
I was reminded of this recently because it’s been a little over ten years since Nick Carr wrote an article in the Harvard Business Review stating the IT doesn’t matter. I opened with this during NetApp’s recent Elevate conferences in Adelaide and Perth, and pointed out that IT that doesn’t improve top line revenue or a company’s agility is a recipe for a focus on nothing more than cost and risk reductions. I was surprised that my comment still provoked a pretty defensive result in some IT professionals.
As I talked about how IT infrastructure teams could learn a lot from agile software development methodologies, and that a datacenter built on software defined infrastructure would allow this, it struck me what was causing this defensive posturing. Risk management was THE key issue that had to be addressed before any of this could happen. To be sure, costs are important, but without a way of dealing with risk effectively, none of this agile, software defined, cloud nirvana was ever going to happen, or certainly not within the timeframes anyone outside of IT was going to tolerate.
This insight was particularly relevant to me because in IT, vendors talk a lot about private cloud to our customers. We talk about accelerating journeys, we talk about how it’s your cloud, we talk about the benefits and we publish case studies. At the same time our product organizations spend increasingly large amounts of their development time and resources on delivering technology to create service catalogs, analytics capabilities and automation and self-service frameworks.
Internally, and between ourselves in the breaks between presentations at events and conferences, many of us wonder why, despite the clear business benefits and available technology, the adoption rate is much slower than we would have expected, and many companies business units are leapfrogging their IT departments internal cloud developments to go directly to large public cloud offerings.
It wasn’t until I got home and I heard my wife say “That’s awesome, they’re teaching the concept of the undo-key” that I had my real epiphany. What she was talking about was a kickstarter project called Robot Turtles, a board game created by Dan Shapiro of Google that teaches primary school kids the basics of programming. While the concept is awesome, it struck me that the ability to easily undo a mistake so fundamental to Agile software development, that it is one of the first concepts you would teach. It was also the reason why infrastructure agility was something that was talked about far more than it was done. People can’t take the same risks with their data infrastructure that you can with software development, or a word processing document, and the reason is that for almost all of us, there is no genuinely effective equivalent of Control-Z for our infrastructure.
Imagine that, in order to roll back a mistake in a word processing document, that first you had to
- Open up a brand new document
- Copy all the text from the first document and past it into the second document, one paragraph at a time
- Run an macro that read the formatting on the first document
- Paste the results of that macro into the second document
Then if you made a mistake that you had to
- Delete your entire paragraph that had the mistake
- Copy the paragraph from the second document
- Find the portion of the script that had the formatting for the document you just copied back
- Run that portion of the script on the original document, and hope that it doesn’t affect any of the other paragraphs or muck up the indexing or cross referencing
Furthermore, imagine that your copy was usually twelve hours old, and you could only recover your data after you’d received permission via a formal change request that had to be approved by three managers who checked them into the change control systems, then arranged for them to be sent back, buried in soft peat for three years and then finally recycled as firelighters.
Clearly, nobody would use any software program that had those limitations, and yet that’s exactly the kind of thing infrastructure professionals have to deal with on a daily basis. It’s no wonder that their perception of risk management and that of the rest of the business are so different.
Agile methodologies deals with risk in a completely different way, it requires that you build your progress on small iterative steps, and that at the end of each step you gain some insight, which you then turn into action. Continuous testing, and continuous deployment significantly reduce the risks of major project failures previously associated with waterfall methodologies. Even with an entire data-center built on software defined infrastructure, without an easy way of testing new infrastructure builds, and fixing and correcting mistakes early, infrastructure operations will never be able to fully support the kinds of agility the business increasingly demands from IT. So long as internal IT lacks an effective undo-key, they will be stuck in the world of waterfall methodologies, and a cost effective, agile private cloud built on software defined principals will remain a future vision instead of a present day reality.
The nice thing from my perspective is that NetApp uniquely provides a well proven set of tools that provides the fine grained undo that works from a single document on a home drive, all the way up to a petabyte scale data-center. We provide a Control-Z that lets you innovate safely, and realize the benefits of private cloud on technology that is already in production in thousands of data centers.
Future blog posts will concentrate on specific technologies like Snapmirror, SnapCreator, and NetApp Shift and how they create and enable a Universal Data Platform that can be used to eliminate the risk that stands between where virtualization stands today, and a truly agile, hybrid cloud tomorrow.
During a discussion I had at the SNIA blogfest, I mentioned that I’d written a whitepaper around archiving and I promised that I’d send it on. It took me a while to get around to this, but I finally dug it out from my archive, which is implemented in a similar way to the example policy at the bottom of this post., it only took me about a minute to search and retreive it once I’d started the process of looking for it from a FAS array that was far enough away from me to incur a 60ms RTT latency. Overall I was really happy with the result.
The document was in my archive because I wrote it almost two years ago, since then a number of things have changed, however the fundamental principals have not, I’ll work on updating this when things less busy, probably sometime around January ’11. On a final note, because I wrote this a couple of years ago when my job role was different than it is today, this document is considerably more “salesy” than my usual blog posts, it shouldnt be construed as a change in direction for this blog in general.
There are a number of approaches that can be broadly classified as some form of Archiving, including Hierarchical Storage Management (HSM), and Information Lifecycle Management (ILM). All of these approaches aim to improve the IT environment by
- Lowering Overall Storage Costs
- Reducing the backup load and improving restore times
- Improving Application Performance
- Making it easier to find and classify data
The following kinds of claims are common in the marketing material promoted by vendors of Archiving software and hardware
“By using Single Instance Storage, data compression and an ATA-based archival storage system (as opposed to a high performance, Fibre Channel device), the customer was able to reduce storage costs by $160,000 per terabyte of messages during the first three years that the joint EMC / Symantec solution was deployed. These cost savings were just the beginning, as the customers were also able to maintain their current management headcount despite a 20% data growth and the time it took to restore messages was drastically reduced. By archiving the messages and files, the customer was also able to improve electronic discovery retrieval times as all content is searchable by keywords.”
These kinds of results while impressive assume a number of things that are not true in NetApp implementations
- A price difference between primary storage and archive storage of over $US160,000 per TB.
- Backup and restores are performed from tape using bulk data movement methods
- Modest increases in storage capacities require additional headcount
In many NetApp environments, the price difference between the most expensive tier of storage and the least expensive simply does not justify the expense and complexity of implementing an archiving system based on a cost per TB alone.
For file serving environments, many file shares can be stored effectively on what would be traditionally thought of as “Tier-3” storage with high-density SATA drives, RAID-6 and compression / deduplication. This is because unique NetApp technologies such as WAFL and RAID-DP provide the performance and reliability required for many file serving environments. In addition, the use of NetApp SnapVault replication based data protection, for backup and long term retention means that full backups are no longer necessary. The presence or absence of the kinds static data typically moved into archives has little or no impact on the time it takes to perform backups, or make data available in the case of disaster.
Finally, the price per GB and IOPS for NetApp storage has fallen consistently in line with the trend in the industry as a whole. Customers can lower their storage costs by purchasing and implementing storage only as required. NetApp FAS array’s ability to non-disruptively add new storage, or move excess storage capacity and I/O from one volume to the other within an aggregate makes this approach both easy, and practical.
While the benefits of archiving for NetApp based file serving environments may be marginal, archiving still has significant advantages for email environments, particularly Microsoft Exchange. The reasons for this are as follows
- Email is cache “unfriendly” and generally needs many dedicated disk spindles for adequate performance.
- Email messages are not modified after they have been sent/received
- There is a considerable amount of “noise” in email traffic (spam, jokes, social banter etc)
- Small Email Stores are easier to cache, which can significantly improve performance and reduce the hardware requirements for both the email servers and the underlying storage
- Email is more likely to be requested during legal discovery
- Enterprises now consider Email to be a mission critical application and some companies still mandate a tape backup of their email environments for compliance purposes.
Choosing the right Archive Storage
It’s about the application
EMC and NetApp take very different approaches to archive storage, each of which works well in a large number of environments. An excellent discussion on the details of this can be found in the NetApp whitepaper WP-7055-1008 Architectural Considerations for Archive and Compliance Solutions. For most people however, the entire process of archive is driven not at the storage layer, but by the archive applications. These applications do an excellent job of making the underlying functionality of the storage system transparent to the end user, however the user is still exposed to the performance and reliability of the storage underlying the archives.
Speed makes a difference
Centera was designed to be “Faster than Optical” and while it has surpassed this relatively low bar, its performance doesn’t come close to even the slowest NetApp array. This is important, because the amount of data that can be pushed onto the archive layer is determined not just by IT policy, but also by user acceptance and satisfaction with the overall solution. The greater the user acceptance, the more aggressive the archiving can be, which results in lower TCO and faster ROI.
Protecting the Archive
While the archive storage layer needs to be reliable, it should be noted that without the archive application and its associated indexes, the data is completely inaccessible, and may as well be lost. It might be possible to rebuild the indexes and application data from the information in the archive alone, often this process may be unacceptably long. Protecting the archive involves protecting the archive data store, the full text indexes, and the associated databases in a consistent manner at a single point in time.
Migrating from an Existing Solution
Many companies already have archiving solutions in place, but would like to change their underlying storage system to something faster and more reliable. Fortunately archiving applications build the capability to migrate date from one kind back-end storage to another into their software. The following diagrams show how this can be achieved for EmailXtender and DiskXtender to move data from Centera to NetApp.
Some organizations would prefer to completely replace their existing archiving solutions including hardware and software. For these customers NetApp collaborates with organizations such as Procedo (www.procedo.com), to make this process fast and painless.
As mentioned previously, the cost and complexity of traditional archiving infrastructure may not add sufficient value to a NetApp file-serving environment, as many of the problems it solves are already addressed by core NetApp features. This does not mean that some form of storage tiering could not or should not be implemented on FAS to reduce the amount of NetApp primary capacity.
One easy way of doing this is by taking advantage of the flexibility of the built in backup technology. This is an extension of the “archiving” policy used by many customers, where the backup system is used for archive as well. The approach of mixing backup and archive is rightly discouraged by most storage management professionals, the reasons for doing so in traditional tape based backup environments don’t apply.
The reasons for this are
- Snapshot and replication based backups are not affected by capacity as only changed blocks are ever moved or stored
- The backups are immediately available, and can be used for multiple purposes
- Backups are stored on high reliability disk in space efficient manner using both non-duplication and de-duplication techniques
- Files can be easily found via existing user interfaces such as Windows Explorer or external search engines
In general, SnapVault destinations use the highest density SATA drives with the most aggressive space savings policies applied to them. These policies and techniques, which may not be suitable for high performance file sharing environments, provide the lowest cost per TB of any NetApp offering. This combined with the ability to place the SnapVault destination in a remote datacenter may relieve the power, space and cooling requirements of increasingly crowded datacenters.
An example policy
Many companies file archiving requirements are straightforward, and do not justify the detailed capabilities provided by archiving applications. For example, a company might implement the following backup and archive policy
- All files are backed up on a daily basis with daily recovery points kept for 14 days, weekly recovery points will be kept for two months and monthly recovery points kept for seven years.
- Any file that has not been accessed in the last sixty days will be removed from primary storage and will need to be accessed from the archive
This is easily addressed in a SnapVault environment through the use of the following
- Daily backups are transferred from the primary system to the SnapVault repository
- Daily recovery points (snapshots) are kept on both the primary storage system and the SnapVault repository for 14 days
- Weekly recovery points (snapshots) are kept only on the SnapVault repository
- Monthly recovery points (snapshots) are kept only on the SnapVault repository
- A simple shell script/batch file is executed after each successful daily backup which deletes any file from the primary volume that has not been accessed in thirty days
- Users are allocated a drive mapping to their replicated directories on the SnapVault destination.
- Optionally the Primary systems and SnapVault repository may be indexed by an application such as the Kazeon IS1200, or Google enterprise search.
Users then need to be informed that old files will be deleted after thirty days, and that they can access backups of their data, including the files that have been deleted from primary storage by looking through the drive that is mapped to the SnapVault repository, or optionally via the enterprise search engines user access tools.
By removing the files from primary storage, instead of the traditional “stub” approach favoured by many archive vendors, the overall performance of the system will be improved by reducing the metadata load, and users will be able to more easily find active files by having fewer files and directories on the primary systems.
Many Organisations archiving requirements can be met by simply adding additional SATA disk to the current production system replicated via SnapMirror to the current DR system – rather than managing separate archive platforms.
This architecture provides flexibility and scalability over time and reduces management overhead. Tape can also be used for additional backup and longer term storage if required. SnapLock provides the non-modifiable WORM like capability required of an archive without additional hardware (a software licensable feature, see more detail at http://www.netapp.com/us/products/protection-software/snaplock.html ).
The last few months have been interesting for me, as my new job role involves a lot of work with alliance partners, many of whom either didn’t know anything about NetApp, or where they did know something it was along the lines of “Oh yeah, the NAS company”. In many respects, it’s a lot easier to explain what we do when someone has an open mind, as pre-conceived notions are often hard to budge, and telling someone they’re wrong is rarely a good way to start a trusted relationship. Even though I report up through our local director of marketing, my soul is still that of an engineer, so when it comes to describing what NetApp does, and why that’s important I tend to go straight to “Well, we still sell NAS, and that’s a big part of what we do, but we really sell is Unified Storage” at which point I expect to see the “and I should care about this because …” look
I’ve been seeing this look quite a bit recently, mostly because many of the people I speak to also get briefs from other storage vendors, and they too have suddenly started talking about “Unified Storage” without really understanding it or explaining its relevance to datacenter transformation. A good case in point was the opportunity I had to speak at the local VMware seminar series where I shared a stage with VMware, Cisco, and EMC. All of us got our 7.5 minutes to explain how we helped accelerate our customers journey to the cloud. VMware went first, followed by EMC, then Cisco and then me..
I’d prepared two slides for my 7 minutes focussing on our key differentiators, Unified storage, tight VMware integration with advanced storage features, deduplication and storage efficiency, Secure Multi-Tenancy, Cisco validated designs, Backup and recovery, and waited happily to see if EMC would come out with their usual pitch.
Boy, was I surprised … EMC’s pitch was Unified storage, deduplication, tight VMware integration with advanced storage features, deduplication and storage efficiency, security and Cisco validated designs … what the ????, had I suddenly slipped into a parallel universe ? Had EMC, a company fairly well known for pushing seven different kinds of storage with forklift upgrades suddenly capitulated and acknowledged that the approach NetApp had been pushing for so many years was actually right ? Was Chuck Hollis about to come on stage and apologise for blatant manipulation of social media and comment filtering ?
Now while I could have picked holes in their story by pointing out that at least from a VMware perspective they don’t have deduplication, that their advanced integration with VAAI hadn’t been released, there was no Cisco Validated Design for vBlock, and that the RSA stuff had no integration at the storage layer, nobody is really interested in hearing vendors denigrate each other, and I only had 7 minutes to figure out how to show our unique ability to help customers in the face of the most shameless “me too” campaign I’ve ever seen. During that 7 minutes there was one thing that really struck me. EMC has no real concept of why unified storage is important. Their concept of unified storage was something that allowed connection by Fibre Channel, iSCSI, CIFS and NFS and had a nice GUI. Having worked at NetApp for a number of years, I was surprised at how they’d missed the point completely. Almost everyone at NetApp knows that these are good features to have (we’ve had them for over 10 years now), but we also know that by themselves, they have only limited benefits. I’ve had a little while now to think about this, and it’s become clear to me that for other vendors, Unified storage is not a strategic direction, but a tactical response to NetApp’s continued success in gaining market share. This becomes even more obvious by taking a look at their storage portfolios
|Entry Level NAS / NAS Gateway||FAS||Iomega||Windows Storage Server||Windows Storage Server||N-Series (OEM)||Windows Storage Server|
|Entry Level SAN / iSCSI||FAS||Celerra NX||Equallogic||MSA
|MidRange NAS||FAS||Celerra NS||Celerra (OEM)||Polyserve ?||N-Series (OEM)||BlueArc(OEM)|
|Archive & Compliance||FAS||Centera||Centera (OEM)||HP RISS||FAS||HCP|
|Backup to disk platform||FAS||DataDomain
HP Sepaton (OEM)
|Diligent VTL||Dilligent (OEM)|
|Storage Virtualisation Gateway||FAS (V-Series)||Invista
N-Series Gateway (OEM)
|Object Repository||StorageGRID||Atmos||StorageGRID (OEM)||StorageGRID (OEM)||HCP|
|High End / Scale Out||FAS / FAS (C-Mode)||V-Max||V-Max (OEM)||USP-V (OEM)||DS6xxx/8xxx
|Mainframe||N/A||V-Max||V-Max (OEM)||USP-V (OEM)||DS8xxx
Now, if you match one of those arrays against the workload they were designed for, you’ll probably get a pretty good result. In a static, reasonably predictable environment without much change, you could make a reasonable argument that this was the best approach to take. You built a silo around an application or function, and purchased the equipment that matched that function. I’ve seen more than one customer that had every product in a vendors portfolio, and seem to be fairly happy, or at least have been until fairly recently.
The problem with these narrow silo’ed approaches is that each silo creates new inefficiencies and dedicated areas of whitespace in both capacity, performance. For example, there is no way of taking excess capacity allocated to a backup to disk appliance and start using it for CIFS home directories, nor is there a way of taking the excess IOPS capability of temporarily idle disk archive and allocate those IOPS to another application undergoing an unusual workload spike such as a VDI bootstorm.
But for me, the biggest area of waste is that of management. Each of these silo’s tends to get its own set of administrators and workflows, each of which may, or may not work in harmony with the other. Most of us have experienced the bitterness and waste of IT turf wars, and the traditional vendors not only encourage, but depend on and help maintainin these functionality silos, as it allows a divide and conquer sales model that benefits the vendor far more than the customer. If there was a book entitled “How to build an inflexible and wasteful IT infrastructure”, I imagine that encouraging and spreading “IT functionality silos” would fill up the first few chapters. Even though there are a bunch of people who have been quite happy with this status quo, and the business processes from budgeting and product selection all the way through to procurement and training that entrenches this model, things are changing, and they’re changing a lot faster than I thought they would.
A lot of credit for this change has to go to vendors like Microsoft, Cisco and VMware whose products have blurred the lines of these traditional silos. Virtualisation at both the compute and network layers have driven the kinds of cross functional change CIO’s have been crying out for, and in the wake of this, Unified storage finds its natural fit ; not because of its support multiple protocols, but because these environments require the kind of workload agility and managment simplicity that only a truly Unified storage offering can fully satisfy.
But it’s not just server and desktop virtualisation and other forms of shared infrastructure where unified storage is a natural fit. Almost any “multi-part” or landscape style application can benefit too, not just because of the flexibility and efficiency, but more importantly because of the fact that these environments are really hard to protect effectively. A really good example of this is Enterprise Content Management Applications such as FileNet and SharePoint
Typically these applications have
- Content Servers
- Business Process Workflow Engines / Servers
- Database servers
- Index Servers
- Content Servers
In a large installation, there will be many of these servers and multiple databases, indexes and content repositories to cater for scalability and in some cases, the tyranny of distance (latency is forever).
In a “traditional” silo’d model this data would be stored on two or possibly three different kinds of storage arrays, each with its own method of backup and replication, most of which depend on some form of “bulk copy” backup method as the primary form of logical data protection. The effect of this is that backing up these ECM systems on traditional storage architectures is almost impossible. While I’ve been talking to customers about this for a few years, recently there seems to have been a big increase in customers seeing these problems. In one case a design review for a Petabyte scale SharePoint implementation identified that if a critical index was lost the entire infrastructure would be effectively unrecoverable, and that there was no effective backup capability. In another discussion I had today around redesigning data protection, a brief mention on ECM created more interest than almost anything else simply because of the difficulty of backing their Documentum system.
Truly unified storage not only allows data to be stored using multiple protocols, and provides rich functionality like deduplication and compliant WORM storage which makes it a logical choice for ECM solutions, but more importantly it also provides a single integrated method for protecting that data in a way that is application-consistent without the need for a “cold” backup. And, you guessed it, NetApp can do that with ease, whereas other vendors’ versions of what they are calling unified storage would find that challenging (to be kind).
In the next few posts, I’ll take a deeper dive into exactly what NetApp does for Enterprise Content Management, with a focus on why Unified Storage is such a good match, and what we do to protect a company’s most important data assets.
Causes of Data Loss
While outages caused by “acts of God”, terrorist attacks, and utility failures garner significant press coverage, the more mundane “day to day” causes of data loss go unreported, and generally un-noticed. Furthermore, a quick Google search on the term “causes of data loss” turn up far more results on what could be more accurately described as “illegal data access”, driven by legislation in the US that mandates public reporting of this class of failure in information security. As NetApp founder, Dave Hitz states in his blog, this kind of reporting, “baffles our risk intuition”, and results in significant amounts of resources being dedicated to solve problems that may never occur. As a case in point, a 2006 survey of over 260 it professionals found that the leading causes of unexpected downtime for databases over the past year were infrastructure issues, followed by software and database glitches. While no root cause analysis was reported for the infrastructure outages, one of the interesting finding from this report was that while most business continuity plans brace for external events beyond the direct control of organizations, few, if any companies in the survey said such events contributed to database downtime.
So if your real problems aren’t likely to be random meteorites, shadowy eastern block crime syndicates, or crazed freedom fighters then what are the things that cause data to become unavailable?
One guide often cited is the following figures from Kroll Ontrack on the causes of data loss
|Cause of data loss||Perception||Reality|
|Hardware or system problem||78%||56%|
|Software corruption or problem||7%||9%|
While the raw data and research methodology behind these figures are unavailable, it seems reasonable to assume that the data came from their own customer base. Kroll Ontrack’s data recovery services are heavily oriented towards workstation and laptop users, whose data is rarely benefits from RAID protected disk subsystems. As a result, it is reasonable to assume the failure percentages caused by hardware and system problems would be far smaller in enterprise class datacentres, though the added complexity may result in a corresponding increase in the percentages for user error, or as steve Chambers put it so succinctly in his blog, a decrease in the mean time before cockup (MTBC)
A more thorough search of available literature and IT news coverage supports this view with anecdotal evidence pointing to human errors being the largest cause of data loss in enterprise environments.
Without a way of rapidly creating multiple recovery points throughout the day, the single largest cause of data loss is not really addressed, especially for those people who demand really small recovery point objectives. While synchronous mirroring might satisfy zero RPO for a very small class of data loss events, the majority of failures are dependent on the old school “bulk copy” methods implemented by 99% of data protection specialists, methods that typically have a recovery point objective of between 12 and 24 hours, and a recovery target objective that can be equally scary.
Planned vs. Unplanned Downtime
Traditional data protection strategies focus on reactive response to unplanned events and the associated unplanned downtime. Yet storage and consequent server and application unavailability result from both unplanned and planned downtime. According to some estimates over 80% of all downtime results is planned. Typically, somewhat more than half of this planned downtime is attributable to database backup while the maintenance, upgrading and replacement of application and system software, hardware and networks typically accounts for most of the rest of the time in this category (Enterprise Management Associates, Inc., 2002).
Although much of this planned downtime occurs during “non business” hours, changes to the business environment brought about by globalisation, online ordering, back office consolidation, more flexible work practices and many other factors means that the number of “non business” hours is gradually, but inexorably being reduced, while the amount of data that needs to be protected is growing exponentially.
There are also problems other than downtime itself that stem from data protection systems dependent on manual configuration and management. Highly skilled personnel with multiple skill sets are required to manage, configure, and optimize the performance of large, distributed data protection infrastructures. Unfortunately, at the same time as these individuals are becoming more difficult to hire and retain, traditional data protection regimes are forcing them to perform most of their implementation and troubleshooting at a time when most people would prefer to be in their beds or with their families. While the professionalism of data protection specialists is typically high, the pressures of late nights, increasing workloads and decreasing resources often lead to increases staff turnover with the consequent rise in data loss caused by human error.
Why tape may be unsuitable for long term archiving
Other than the difficulty of expunging data which should no longer be kept, tape is a poor choice for long term archives for two other reasons. The first is that the Commonwealth and State Electronic Transactions Acts for the legal requirements for electronic transactions, and archiving procedures for electronic records state that data retention methods must allow for changes in technology. Tape has a poor track record in meeting this requirement, where for example data recorded on a “DLT III” tape cartridge which was still widely used less then 6 years ago, cannot be read by any commercially available tape drive today
Secondly “For any information kept in electronic form, the method of generating the information must provide a reliable way of maintaining the integrity of the information, unless a specific storage device is provided for by the relevant legislation.” . Tape is a relatively delicate contact media, which degrades with use, can become physically damaged and is adversely affected by swings in environmental conditions. Data stored on tape can also be lost from exposure to magnetic fields. Thus, in order to provide “a reliable way of maintaining the integrity of the information” tapes must be periodically refreshed (read and rewritten). Managing refresh cycles for hundreds of tapes written over many years is a complex and an extremely costly task with potentially serious consequences if not managed properly.
Is tape really capable of keeping data for long periods of time ?
Some tape media such as LTO-4 is often touted as having a 30 year archival life. For a technology that is less than three years old, this kind of claim can only be relied upon through a fair degree of faith in the vendors’ statistical analysis techniques. IT and business management are asked to take this leap of faith while accepting that unlike disk, there is little or no hard data published for tape on “mean time to data loss”, or annual failure rates under various conditions. Given the high rates of dissatisfaction with tape based backup, vendor claims of long-term reliability may need to be reviewed with greater vigor.
Tape is only as good as its handlers
One of the major failings of tape, is not the technology itself, but the way in which it is treated. In many cases the staff entrusted with tape management and movement are in entry level IT positions, or semi skilled third party couriers. Even tape media manufacturers openly acknowledge that expected archival life times are only for tapes kept in “optimum” operating and storage conditions of 16C to 25C ,Relative Humidity 20% to 50%, and no shock or vibration, none of which apply to courier vans. In addition, tape drives must themselves be subject to rigorous preventative maintenance. The reason for this is that atape that is used in a drive that was not well maintained and has accumulated dirt and debris from dirty heads, roller guides and other transport assemblies, may find this debris gets transferred to the tape media. When these dirty tapes are subsequently used in a good drive, they may transfer some or all of those contaminants and degrade a previously clean drive. As the new drive becomes contaminated, a variety of problems can result, including premature head wear, debris accumulation on critical parts of the drive transport, and then damage to the tape. This leads to an even larger media impact as any new tapes that are used in the drive can also be damaged,
A further cautionary note when using a backup application for long-term archive is that the media is recorded in a proprietary logical format readable only by the originating application. Backup vendors have been known to discontinue backwards read compatibility for their own logical tape formats, and a change of backup vendors, or products from the same vendor may make recovery from archived tapes difficult, if not impossible to do. Thus, true long-term archive would also require archiving the entire backup system including the computer, recording hardware and software as well as multiple copies of the media.
While this post is really more about archiving than it is about backup, almost every backup environment I’ve ever come across is still used to store long term archives on tape media. Most of the time this works reasonably well, but far too often it doesnt. To put this in perspective, if you needed the data on the tape to defend yourself in a legal battle, how would you feel about only having a “pretty good chance” of getting the informtaion you needed ?
How backup systems fail to satisfy regulatory requirements
If we then examine the typical backup retention policies against Australian regulatory requirements, we find that Sections 9 and 286 of the Corporations Act states that the following information needs to be kept
Financial Records (invoices, receipts, orders for the payment of money, bills of exchange, cheques, promissory notes, vouchers and other documents of prime entry; and such working papers and other documents as are necessary to explain the methods and calculations by which accounts are made up) that correctly record and explain the transactions and (including any transactions as trustee) and would enable true and fair financial statements to be prepared.
It is this legislation the drives the vast majority of the “7 year retention” requirement. This is then applied as a blanket policy across all data types regardless of whether the data falls under the definition of financial records above. This often results in large amounts of data being kept with little or no business justification.
Unfortunately, even for data which does fall under the Act, the “keep monthly backups for 7 years” policy does not completely satisfy the above requirement. Take for example a spreadsheet meeting the definition of a “working paper” above, that was created on the 4th day of the month, used as the basis for transaction on the 6th day of the month, and then inadvertently deleted, or changed on the 9th day. There is no guarantee that any document of this type will appear in the monthly archives as they are created after the previous month’s backup, and destroyed before the current month’s backup takes place.
If this wasn’t bad enough, there are a number of other regulations that requires data to be kept for a certain period of time after a specific event has passed. One example of this is the Workplace Relations Act that requires pay slips to be kept for seven years after employment is terminated. In the case of an employee who has been working for five years the “keep monthly backups for 7 years” retention policy would begin to cause potential non compliance two years after the termination of that employee. As a final complication, the Privacy Act of 1988 states that an organization must take reasonable steps to destroy or permanently de-identify personal information if it is no longer needed; a legal requirement which may prove very difficult with which to comply if tape backup is the primary method used for data archiving.
The reason traditional backup systems fails for regulatory compliance is that it was never designed for the task, nor is tape, the traditional backup media of choice.
Although there is considerable overlap in the functional requirements, backup is not the same as archive or disaster recovery. If people allowed a backup system to be just that, without overloading it with other non-core requirements, then it would have a good chance of meeting its data availabilty targets at a reasonable cost, however while it tries to carry these additional burdens, it is beaten before it has even started the race.