Don't Drop Things

Don't Drop Things

I had an interesting event on my home NAS box last night. I woke up to my notification emails from both smartd and my custom ZFS monitoring script. Primarily my NAS stores backups of all of my photography work - over 2 TB of photos are on it, so it's pretty important to me.

So here's the build:

  • Four 2 TB enterprise grade Seagate drives for data, plus a single SSD for the OS, which is Ubuntu 14.04.
  • The Seagates are formatted as a RAID-Z partition (equivalent to Raid-5.)
  • The Seagates are configured with hdparm to spin down after 180 minutes of inactivity. The downside is that when I stream my first vid of the night I have to wait for them to spin up.
  • I wrote a script that checks the status of the ZFS pool every 15 minutes and emails me if there's a problem.
  • Smartd is also monitoring hard drive health.

Here's the event:

Last night I decided to watch a movie from my NAS drive that I'd ripped years ago. I fell asleep watching the movie, which streamed perfectly.

I woke up this morning to find that actually 2 hours BEFORE I'd started streaming it, I was notified by Smart that one of the four Seagates wasn't there anymore. The /dev/sdc completely vanished.

My ZFS monitoring script noticed about the time I started streaming, and emailed me to let me know the raid was degraded.

I rebooted the box and Smart immediately said all was well. All mount points were back. ZFS reported that it had recovered but still had the errors logged. It also reported that it resilvered a little over 5 MB of data that had changed on the down disk since failure. I cleared the error and all pools show healthy now.

Here's the takeaway:

First, I love ZFS. The fact that I was still watching movies from the array without difficulty is awesome.

Second, I'm glad I had monitoring going. It would have been a long time before I noticed if I hadn't set up the email notifications.

Third, don't drop anything on your NAS server. I realized by looking at the notification timestamps that the failure occurred at the same time I accidentally knocked something off my desk onto the NAS server. I should probably pop the side off and make sure none of the cables are loose.

Posted by Tony on Feb 06, 2016 | ZFS, Servers