Backup Is Evil – Part 5 – The real threat
Causes of Data Loss
While outages caused by “acts of God”, terrorist attacks, and utility failures garner significant press coverage, the more mundane “day to day” causes of data loss go unreported, and generally un-noticed. Furthermore, a quick Google search on the term “causes of data loss” turn up far more results on what could be more accurately described as “illegal data access”, driven by legislation in the US that mandates public reporting of this class of failure in information security. As NetApp founder, Dave Hitz states in his blog, this kind of reporting, “baffles our risk intuition”, and results in significant amounts of resources being dedicated to solve problems that may never occur. As a case in point, a 2006 survey of over 260 it professionals found that the leading causes of unexpected downtime for databases over the past year were infrastructure issues, followed by software and database glitches. While no root cause analysis was reported for the infrastructure outages, one of the interesting finding from this report was that while most business continuity plans brace for external events beyond the direct control of organizations, few, if any companies in the survey said such events contributed to database downtime.
So if your real problems aren’t likely to be random meteorites, shadowy eastern block crime syndicates, or crazed freedom fighters then what are the things that cause data to become unavailable?
One guide often cited is the following figures from Kroll Ontrack on the causes of data loss
|Cause of data loss||Perception||Reality|
|Hardware or system problem||78%||56%|
|Software corruption or problem||7%||9%|
While the raw data and research methodology behind these figures are unavailable, it seems reasonable to assume that the data came from their own customer base. Kroll Ontrack’s data recovery services are heavily oriented towards workstation and laptop users, whose data is rarely benefits from RAID protected disk subsystems. As a result, it is reasonable to assume the failure percentages caused by hardware and system problems would be far smaller in enterprise class datacentres, though the added complexity may result in a corresponding increase in the percentages for user error, or as steve Chambers put it so succinctly in his blog, a decrease in the mean time before cockup (MTBC)
A more thorough search of available literature and IT news coverage supports this view with anecdotal evidence pointing to human errors being the largest cause of data loss in enterprise environments.
Without a way of rapidly creating multiple recovery points throughout the day, the single largest cause of data loss is not really addressed, especially for those people who demand really small recovery point objectives. While synchronous mirroring might satisfy zero RPO for a very small class of data loss events, the majority of failures are dependent on the old school “bulk copy” methods implemented by 99% of data protection specialists, methods that typically have a recovery point objective of between 12 and 24 hours, and a recovery target objective that can be equally scary.