Home > Archive, Data Protection, Value > Why Archive ?

Why Archive ?


During a discussion I had at the SNIA blogfest, I mentioned that I’d written a whitepaper around archiving and I promised that I’d send it on. It took me a while to get around to this, but I finally dug it out from my archive, which is implemented in a similar way to the example policy at the bottom of this post.,  it only took me about a minute to search and retreive it once I’d started the process of looking for it from a FAS array that was far enough away from me to incur a 60ms RTT latency. Overall I was really happy with the result.

The document was in my archive because I wrote it almost two years ago, since then a number of things have changed, however the fundamental principals have not, I’ll work on updating this when things less busy, probably sometime around January ’11. On a final note, because I wrote this a couple of years ago when my job role was different than it is today, this document is considerably more “salesy” than my usual blog posts, it shouldnt be construed as a change in direction for this blog in general.


There are a number of approaches that can be broadly classified as some form of Archiving, including Hierarchical Storage Management (HSM), and Information Lifecycle Management (ILM).  All of these approaches aim to improve the IT environment by

  • Lowering Overall Storage Costs
  • Reducing the backup load and improving restore times
  • Improving  Application Performance
  • Making it easier to find and classify data

The following kinds of claims are common in the marketing material promoted by vendors of Archiving software and hardware

“By using Single Instance Storage, data compression and an ATA-based archival storage system (as opposed to a high performance, Fibre Channel device), the customer was able to reduce storage costs by $160,000 per terabyte of messages during the first three years that the joint EMC / Symantec solution was deployed. These cost savings were just the beginning, as the customers were also able to maintain their current management headcount despite a 20% data growth and the time it took to restore messages was drastically reduced. By archiving the messages and files, the customer was also able to improve electronic discovery retrieval times as all content is searchable by keywords.”

These kinds of results while impressive assume a number of things that are not true in NetApp implementations

  • A price difference between primary storage and archive storage of over $US160,000 per TB.
  • Backup and restores are performed from tape using bulk data movement methods
  • Modest increases in storage capacities require additional headcount

In many NetApp environments, the price difference between the most expensive tier of storage and the least expensive simply does not justify the expense and complexity of implementing an archiving system based on a cost per TB alone.

For file serving environments, many file shares can be stored effectively on what would be traditionally thought of as “Tier-3” storage with high-density SATA drives, RAID-6 and compression / deduplication. This is because unique NetApp technologies such as WAFL and RAID-DP provide the performance and reliability required for many file serving environments. In addition, the use of NetApp SnapVault replication based data protection, for backup and long term retention means that full backups are no longer necessary. The presence or absence of the kinds static data typically moved into archives has little or no impact on the time it takes to perform backups, or make data available in the case of disaster.

Finally, the price per GB and IOPS for NetApp storage has fallen consistently in line with the trend in the industry as a whole. Customers can lower their storage costs by purchasing and implementing storage only as required. NetApp FAS array’s ability to non-disruptively add new storage, or move excess storage capacity and I/O from one volume to the other within an aggregate makes this approach both easy, and practical.

While the benefits of archiving for NetApp based file serving environments may be marginal, archiving still has significant advantages for email environments, particularly Microsoft Exchange. The reasons for this are as follows

  1. Email is cache “unfriendly” and generally needs many dedicated disk spindles for adequate performance.
  2. Email messages are not modified after they have been sent/received
  3. There is a considerable amount of “noise” in email traffic (spam, jokes, social banter etc)
  4. Small Email Stores are easier to cache, which can significantly improve performance and reduce the hardware requirements for both the email servers and the underlying storage
  5. Email is more likely to be requested during legal discovery
  6. Enterprises now consider Email to be a mission critical application and some companies still mandate a tape backup of their email environments for compliance purposes.

Choosing the right Archive Storage

It’s about the application

EMC and NetApp take very different approaches to archive storage, each of which works well in a large number of environments. An excellent discussion on the details of this can be found in the NetApp whitepaper WP-7055-1008 Architectural Considerations for Archive and Compliance Solutions. For most people however, the entire process of archive is driven not at the storage layer, but by the archive applications. These applications do an excellent job of making the underlying functionality of the storage system transparent to the end user, however the user is still exposed to the performance and reliability of the storage underlying the archives.

Speed makes a difference

Centera was designed to be “Faster than Optical” and while it has surpassed this relatively low bar, its performance doesn’t come close to even the slowest NetApp array. This is important, because the amount of data that can be pushed onto the archive layer is determined not just by IT policy, but also by user acceptance and satisfaction with the overall solution. The greater the user acceptance, the more aggressive the archiving can be, which results in lower TCO and faster ROI.

Protecting the Archive

While the archive storage layer needs to be reliable, it should be noted that without the archive application and its associated indexes, the data is completely inaccessible, and may as well be lost. It might be possible to rebuild the indexes and application data from the information in the archive alone, often this process may be unacceptably long. Protecting the archive involves protecting the archive data store, the full text indexes, and the associated databases in a consistent manner at a single point in time.

Migrating from an Existing Solution

Many companies already have archiving solutions in place, but would like to change their underlying storage system to something faster and more reliable. Fortunately archiving applications build the capability to migrate date from one kind back-end storage to another into their software. The following diagrams show how this can be achieved for EmailXtender and DiskXtender to move data from Centera to NetApp.

Some organizations would prefer to completely replace their existing archiving solutions including hardware and software. For these customers NetApp collaborates with organizations such as Procedo (www.procedo.com), to make this process fast and painless.


As mentioned previously, the cost and complexity of traditional archiving infrastructure may not add sufficient value to a NetApp file-serving environment, as many of the problems it solves are already addressed by core NetApp features. This does not mean that some form of storage tiering could not or should not be implemented on FAS to reduce the amount of NetApp primary capacity.

One easy way of doing this is by taking advantage of the flexibility of the built in backup technology. This is an extension of the “archiving” policy used by many customers, where the backup system is used for archive as well.  The approach of mixing backup and archive is rightly discouraged by most storage management professionals, the reasons for doing so in traditional tape based backup environments don’t apply.

The reasons for this are

  • Snapshot and replication based backups are not affected by capacity as only changed blocks are ever moved or stored
  • The backups are immediately available, and can be used for multiple purposes
  • Backups are stored on high reliability disk in space efficient manner using both non-duplication and de-duplication techniques
  • Files can be easily found via existing user interfaces such as Windows Explorer or external search engines

In general, SnapVault destinations use the highest density SATA drives with the most aggressive space savings policies applied to them. These policies and techniques, which may not be suitable for high performance file sharing environments, provide the lowest cost per TB of any NetApp offering. This combined with the ability to place the SnapVault destination in a remote datacenter may relieve the power, space and cooling requirements of increasingly crowded datacenters.

An example policy

Many companies file archiving requirements are straightforward, and do not justify the detailed capabilities provided by archiving applications. For example, a company might implement the following backup and archive policy

  • All files are backed up on a daily basis with daily recovery points kept for 14 days, weekly recovery points will be kept for two months and monthly recovery points kept for seven years.
  • Any file that has not been accessed in the last sixty days will be removed from primary storage and will need to be accessed from the archive

This is easily addressed in a SnapVault environment through the use of the following

  • Daily backups are transferred from the primary system to the SnapVault repository
  • Daily recovery points (snapshots) are kept on both the primary storage system and the SnapVault repository for 14 days
  • Weekly recovery points (snapshots) are kept only on the SnapVault repository
  • Monthly recovery points (snapshots) are kept only on the SnapVault repository
  • A simple shell script/batch file is executed after each successful daily backup which deletes any file from the primary volume that has not been accessed in thirty days
  • Users are allocated a drive mapping to their replicated directories on the SnapVault destination.
  • Optionally the Primary systems and SnapVault repository may be indexed by an application such as the Kazeon IS1200, or Google enterprise search.

Users then need to be informed that old files will be deleted after thirty days, and that they can access backups of their data, including the files that have been deleted from primary storage by looking through the drive that is mapped to the SnapVault repository, or optionally via the enterprise search engines user access tools.

By removing the files from primary storage, instead of the traditional “stub” approach favoured by many archive vendors, the overall performance of the system will be improved by reducing the metadata load, and users will be able to more easily find active files by having fewer files and directories on the primary systems.


Many Organisations archiving requirements can be met by simply adding additional SATA disk to the current production system replicated via SnapMirror to the current DR system – rather than managing separate archive platforms.

This architecture provides flexibility and scalability over time and reduces management overhead. Tape can also be used for additional backup and longer term storage if required. SnapLock provides the non-modifiable WORM like capability required of an archive without additional hardware (a software licensable feature, see more detail at http://www.netapp.com/us/products/protection-software/snaplock.html ).

Categories: Archive, Data Protection, Value
  1. March 19, 2011 at 1:15 am

    I was with you on the snapvault stuff until
    **A simple shell script/batch file is executed after each successful daily backup which deletes any file from the primary volume that has not been accessed in thirty days
    **Users are allocated a drive mapping to their replicated directories on the SnapVault destination

    To me shell script/batch files and the users needing to use another mapped drive to access old files, doesn’t feel enterprise ready.

    The stub files may impact the performance, but they don’t change the way a user works, which for me is very important.

    • March 21, 2011 at 3:21 pm

      thanks for stopping by, and more importantly thanks for the comment.

      I think I understand your feelings on this, the mere mention of scripts is enough to set off alarm bells for most. The basis of the example policy was something I wrote for a customer who had recently been through a long exercise in trying to make a stub based HSM system pay off and who’s requirements were remarkably simple. For them adding another 5 or so lines onto their existing logon/logoff scripts was a trivial exercise, and the clear separation this solution created between current operational data and older/inactive data within the users minds paid off in other ways too. Just because a user can keep adding data indefinitely into a drive share without any kind of data grooming, doesnt mean that it’s good policy, and in this particular IT managers view, the existing stub based solution encouraged an unreasonable and unsustainable view of data managemet on the the end users behalf.

      In the process of writing this response, I’ve realized that I should add some other example policies that exploit the strength of this approach, including potentially integrating with cloud based content repositories and other “big data” approaches. My strong feeling is that solving the long term problems of data management will require process changes at the end user level, or possibly embedding a lot of this intelligence at the content creation layer. Imagine if Acrobat or Word could automatically infer the retention and digital rights management policies for a document based on things like the author, key words, review chains etc and communicate this to intelligent document repositories using open API’s like CDMI.

      Once again, thanks for the comments, when I find the time it will improve this post, and the work I do in the future.


  1. August 22, 2012 at 6:47 am

Leave a Reply - Comments Manually Moderated to Avoid Spammers

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: