Some Thoughts on Bit Rot.
During some recent discussions on Twitter, the subject of disk drive rebuild times for very large drives in excess of 10TB has raised the subject of urecoverable read errors also known as UER, which is sometimes blamed on something called “bit rot” however, two NetApp sponsored studies shows that bit rot is far less of a problem for storage array reliability than many other factors.
The best publically available data on bit rot and it’s impact compared to other causes I’ve found is contained in “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf”. The following information summarizes and paraphrases the information found in that document.
What Bit rot is and why you should care
Bit rot is a concern for two main reasons, for the home user with no RAID protection, it results in the inconvenience of a lost or corrupted file, or possibly a machine that wont boot, for the enterprise user, bit rot raises the specter, not just of a lost or corrupted file, but of the potential to completely lose an entire RAID group after the failure of a single drive due to the “Media Error on Data Reconstruct” problem. The less catastrophic issue on a enterprise calss array is far less because the additional error detection and correction available through the use of RAID and block level checksums means the chances of bit rot causing the loss or corruption of a file is vanishingly remote.
What I believe most people mean by bit rot, could be more accurately described as latent media errors rather “bit rot” which is more strictly caused by degradation of the magnetic properties of the media.
The reason for this is that most early RAID reliability models assumed that data will remain undestroyed except by “bit rot”. Although it is correct that the magnetic properties of the media can degrade, this failure mechanism is not a significant cause. Data can become corrupted any time the disks are spinning, even when data are not being written to or read from the disk. The failure mechanisms outlined below here are not unknown, but neither are they readily available from HDD manufacturers
Common Causes for losing data
Four common causes for losing data after its been correctly written are “Thermal asperities”, scratches and smears, and corrosion.
- Thermal asperities are instances of high heat for a short durations caused by head-disk contact. This is usually the result of heads hitting small “bumps” created by particles embedded in the media surface during the manufacturing process. The heat generated on a single contact may not be sufficient to thermally erase data but may be sufficient after many contacts.
- Although disk heads are designed to push particles away, but contaminants can still become lodged between the head and disk, hard particles used in the manufacture of an HDD, can cause surface scratches and data erasure any time the disk is rotating.
- Other “soft”materials such as stainless steel can come from assembly tooling. Soft particles tend to smear across the surface of the media, rendering the data unreadable.
- Corrosion, although carefully controlled, can also cause data erasure and may be accelerated by thermal asperity generated heat
Why data is sometimes not there in the first place
A latent defect can also be caused by data that was incorrectly, or incompletely written to the disk in the first place, this can happen, this can happen because of the inherent “Bit Error Rate” or BER, writing to damaged media, or too much lubrication and “high-fly writes”
- The bit error rate (BER) is a statistical measure of the effectiveness of all the electrical, mechanical, magnetic, and firmware control systems working together to write (or read) data. Most bit errors occur on a read command and are corrected, but since written data are rarely checked immediately after writing, bit errors can also occur during writes.
- BER accounts for a fraction of defective data written to the HDD, but a greater source of errors is the magnetic recording media that coats the disks. Writing on scratched, smeared, or pitted media can result in corrupted data. The reasons for scratches and smears where covered earlier, however “pits and voids are caused by particles that were originally embedded in the media during the manufacturing process and subsequently dislodged during the polishing process or field use.
- The final common cause for poorly written data is the “high-fly write.” The heads are aerodynamically designed to have a negative pressure and maintain the small, fixed distance above the disk surface at all times. If the aerodynamics are disturbed, the head can fly too high, resulting in weakly (magnetically) written data that cannot be read. In addition to “wind gusts” inside the disk, all disks have a very thin film of lubricant on them to help protection from head-disk contact. While this lubrication helps mitigate the effects of “thermal asperities”, lubrication build-up on the head can increase the flying height, resulting in weak or incomplete writes.
Where’s my data ?
Finally, all the data may have been written correctly, but the disk may not be able to “find” it, because of damage to special “servo” tracks which help keep the heads correctly aligned to the data on the disk. In some cases, it’s not damage to the servo tracks but wear and tear on the motor and disk head bearings, noise, vibration and other electromechanical errors can cause the head positioning to take too long to lock onto a track which ultimately also causes “latent block errors”
How to protect yourself
There are two main ways of dealing with these kinds of latent block errors, the first is to perform disk scrubs, which is something every reputable array vendor does, the problem is however that as disk sizes get larger and larger, the time taken to perform a full disk scrub can take too long for the protection to be as effective as it should. The other method is to use additional levels of RAID protection such as RAID-6 which allows for higher levels of resiliency and error correction in the event of hitting a latent block error when reconstructing a RAID set. NetApp uses both approaches as studies have shown that the risk of losing data through these kinds of events is thousands of times higher than predicted by most simple “MTBF” failure models.