Disk Delving - 2 Good Papers and a Blog
“The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive.” – StorageMojo - Google’s Disk Failure Experience
There have been two excellent papers on disk drive failures released recently, the Dugg and Dotted Google paper - Failure trends in a large disk drive population (warning: PDF) and the also excellent but less hyped Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?.
Both papers make very interesting reading, the comparisons of SCSI to SATA disks alone should turn some heads, but they are a little dry, so once you’ve worked your way through them it’s worth looking at the summarised highlights over at StorageMojo, a top notch blog that was recommended to me by Kim Hawtin. StorageMojo covered both papers and I’ve linked to them in the quotes above and below.
“Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.” – StorageMojo - Everything You Know About Disks Is Wrong