自宅サーバのRAID障害が出た日

以前構築した自宅サーバのRAID環境から障害通知メールが来た。


件名:mdadm monitoring

This is an automatically generated mail message from mdadm
running on localhost.localdomain

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdc1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md0 : active raid1 sdc1[2](F) sdb1[0]
     976759936 blocks [2/1] [U_]

unused devices: 

/var/log/messagesを確認してそれっぽいエラーが出ているのがここらへん。

ar 18 05:43:04 localhost kernel: SCSI device sdc: 1953525168 512-byte hdwr sectors (1000205 MB)
Mar 18 05:43:11 localhost kernel: sdc: Write Protect is off
Mar 18 05:43:12 localhost kernel: SCSI device sdc: drive cache: write back
Mar 18 05:44:12 localhost kernel: ata2.01: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
Mar 18 05:44:12 localhost kernel: ata2: SError: { UnrecovData Handshk }
Mar 18 05:44:12 localhost kernel: ata2.01: cmd 35/00:00:3f:05:c3/00:04:09:00:00/f0 tag 0 dma 524288 out
Mar 18 05:44:12 localhost kernel:          res 40/00:cf:61:05:c3/40:03:09:00:00/f0 Emask 0x14 (ATA bus error)             169,51        34%
Mar 18 05:43:04 localhost kernel: sdc: Current [descriptor]: sense key: Medium Error
Mar 18 05:43:04 localhost kernel:     Add. Sense: Unrecovered read error - auto reallocate failed
Mar 18 05:43:04 localhost kernel:
Mar 18 05:43:04 localhost kernel: Descriptor sense data with sense descriptors (in hex):
Mar 18 05:43:04 localhost kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Mar 18 05:43:04 localhost kernel:         09 c3 05 61
Mar 18 05:43:04 localhost kernel: end_request: I/O error, dev sdc, sector 163775841
Mar 18 05:43:04 localhost kernel: ata2: EH complete                                                                                         
Mar 18 05:43:04 localhost kernel: SCSI device sdc: 1953525168 512-byte hdwr sectors (1000205 MB)
Mar 18 05:43:11 localhost kernel: sdc: Write Protect is off
Mar 18 05:43:12 localhost kernel: SCSI device sdc: drive cache: write back
Mar 18 05:44:12 localhost kernel: ata2.01: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
Mar 18 05:44:12 localhost kernel: ata2: SError: { UnrecovData Handshk }
Mar 18 05:44:12 localhost kernel: ata2.01: cmd 35/00:00:3f:05:c3/00:04:09:00:00/f0 tag 0 dma 524288 out
Mar 18 05:44:12 localhost kernel:          res 40/00:cf:61:05:c3/40:03:09:00:00/f0 Emask 0x14 (ATA bus error)
                                                                                                                          169,51        33%
Mar 18 05:43:04 localhost kernel: sdc: Current [descriptor]: sense key: Medium Error
Mar 18 05:43:04 localhost kernel:     Add. Sense: Unrecovered read error - auto reallocate failed
Mar 18 05:43:04 localhost kernel:
Mar 18 05:43:04 localhost kernel: Descriptor sense data with sense descriptors (in hex):
Mar 18 05:43:04 localhost kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Mar 18 05:43:04 localhost kernel:         09 c3 05 61
Mar 18 05:43:04 localhost kernel: end_request: I/O error, dev sdc, sector 163775841
Mar 18 05:43:04 localhost kernel: ata2: EH complete                                                                                         
Mar 18 05:43:04 localhost kernel: SCSI device sdc: 1953525168 512-byte hdwr sectors (1000205 MB)
Mar 18 05:43:11 localhost kernel: sdc: Write Protect is off
Mar 18 05:43:12 localhost kernel: SCSI device sdc: drive cache: write back
Mar 18 05:44:12 localhost kernel: ata2.01: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
Mar 18 05:44:12 localhost kernel: ata2: SError: { UnrecovData Handshk }
Mar 18 05:44:12 localhost kernel: ata2.01: cmd 35/00:00:3f:05:c3/00:04:09:00:00/f0 tag 0 dma 524288 out
Mar 18 05:44:12 localhost kernel:          res 40/00:cf:61:05:c3/40:03:09:00:00/f0 Emask 0x14 (ATA bus error)
Mar 18 05:44:12 localhost kernel: ata2.01: status: { DRDY }                                                               169,51        34%
Mar 18 05:44:23 localhost kernel: ata2: hard resetting link
Mar 18 05:44:24 localhost kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar 18 05:44:24 localhost kernel: ata2.01: revalidation failed (errno=-2)
Mar 18 05:44:24 localhost kernel: ata2.01: disabled
Mar 18 05:44:25 localhost kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x40)
Mar 18 05:44:25 localhost kernel: ata2.00: revalidation failed (errno=-5)
Mar 18 05:44:25 localhost kernel: ata2: failed to recover some devices, retrying in 5 secs
Mar 18 05:44:30 localhost kernel: ata2: hard resetting link                                                                                 
Mar 18 05:44:30 localhost kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar 18 05:44:30 localhost kernel: ata2.00: configured for UDMA/133
Mar 18 05:44:30 localhost kernel: ata2: EH complete

以下らへんを参考に復旧作業中だがだめだったらディスク買い換えるか、、、、

CentOSでソフトウェアRAIDの構築
smartd での sector error 復活作業
ext3でフォーマットする方法
HDDの不良ブロックをどうにかできないか – ubuntu badblocks mke2fs e2fsck

【追記】
badblocksコマンドで/dev/sdcのディスクについて不良セクタのチェックを行ったが特に検出されなかった(1TBのディスクなので--fullオプションをつけて実行したら一週間ぐらいかかった)
とりあえずRAID1を構築しなおして/dev/md0のRAIDデバイスについてext3でフォーマットしなおしたら復旧した。
ソフトウェアRAIDの書き込み失敗とかそういう論理的な問題だったのだろうか、、、、