Hard Drive Failure

Lost a drive that had data that I did not have a full backup. It sucks but the data was not critical.

There were couple lessons learned, first off the root cause, I think a fan died, I have not opened it up yet, but a fan heated up the drives causing them do fail.

So I have figured out how to monitor the drive temps. I am using a utility called hddtemp. It produces nice simple info on drive temps, but in addition it has a daemon which a nagios plugin can communicate with to produce an alert if temps go to high.

Furthermore, I set up smartd to let me know if a drive is failing as well.

Another thing I wanted to know was what were on those drives exactly, so I used find to walk through those drives and produce info using file and stat.

I did replace the drive and I am watching temps. I have a large fan cooling the system down for now.

The temp at 9:57PM was 59C for one drive after the fan was on it for about 15 minutes the temp dropped to 42C. I have it checking if temps go above 35 to warn and 48 for critical.

Lets see if temps stay down.

Weight: 314,4

This entry was posted in Technical, Training, Weigh In. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.