自宅サーバのRAID障害が出た日
以前構築した自宅サーバのRAID環境から障害通知メールが来た。
件名:mdadm monitoring
This is an automatically generated mail message from mdadm running on localhost.localdomain A Fail event had been detected on md device /dev/md0. It could be related to component device /dev/sdc1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 sdc1[2](F) sdb1[0] 976759936 blocks [2/1] [U_] unused devices:
/var/log/messagesを確認してそれっぽいエラーが出ているのがここらへん。
ar 18 05:43:04 localhost kernel: SCSI device sdc: 1953525168 512-byte hdwr sectors (1000205 MB) Mar 18 05:43:11 localhost kernel: sdc: Write Protect is off Mar 18 05:43:12 localhost kernel: SCSI device sdc: drive cache: write back Mar 18 05:44:12 localhost kernel: ata2.01: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen Mar 18 05:44:12 localhost kernel: ata2: SError: { UnrecovData Handshk } Mar 18 05:44:12 localhost kernel: ata2.01: cmd 35/00:00:3f:05:c3/00:04:09:00:00/f0 tag 0 dma 524288 out Mar 18 05:44:12 localhost kernel: res 40/00:cf:61:05:c3/40:03:09:00:00/f0 Emask 0x14 (ATA bus error) 169,51 34% Mar 18 05:43:04 localhost kernel: sdc: Current [descriptor]: sense key: Medium Error Mar 18 05:43:04 localhost kernel: Add. Sense: Unrecovered read error - auto reallocate failed Mar 18 05:43:04 localhost kernel: Mar 18 05:43:04 localhost kernel: Descriptor sense data with sense descriptors (in hex): Mar 18 05:43:04 localhost kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Mar 18 05:43:04 localhost kernel: 09 c3 05 61 Mar 18 05:43:04 localhost kernel: end_request: I/O error, dev sdc, sector 163775841 Mar 18 05:43:04 localhost kernel: ata2: EH complete Mar 18 05:43:04 localhost kernel: SCSI device sdc: 1953525168 512-byte hdwr sectors (1000205 MB) Mar 18 05:43:11 localhost kernel: sdc: Write Protect is off Mar 18 05:43:12 localhost kernel: SCSI device sdc: drive cache: write back Mar 18 05:44:12 localhost kernel: ata2.01: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen Mar 18 05:44:12 localhost kernel: ata2: SError: { UnrecovData Handshk } Mar 18 05:44:12 localhost kernel: ata2.01: cmd 35/00:00:3f:05:c3/00:04:09:00:00/f0 tag 0 dma 524288 out Mar 18 05:44:12 localhost kernel: res 40/00:cf:61:05:c3/40:03:09:00:00/f0 Emask 0x14 (ATA bus error) 169,51 33% Mar 18 05:43:04 localhost kernel: sdc: Current [descriptor]: sense key: Medium Error Mar 18 05:43:04 localhost kernel: Add. Sense: Unrecovered read error - auto reallocate failed Mar 18 05:43:04 localhost kernel: Mar 18 05:43:04 localhost kernel: Descriptor sense data with sense descriptors (in hex): Mar 18 05:43:04 localhost kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Mar 18 05:43:04 localhost kernel: 09 c3 05 61 Mar 18 05:43:04 localhost kernel: end_request: I/O error, dev sdc, sector 163775841 Mar 18 05:43:04 localhost kernel: ata2: EH complete Mar 18 05:43:04 localhost kernel: SCSI device sdc: 1953525168 512-byte hdwr sectors (1000205 MB) Mar 18 05:43:11 localhost kernel: sdc: Write Protect is off Mar 18 05:43:12 localhost kernel: SCSI device sdc: drive cache: write back Mar 18 05:44:12 localhost kernel: ata2.01: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen Mar 18 05:44:12 localhost kernel: ata2: SError: { UnrecovData Handshk } Mar 18 05:44:12 localhost kernel: ata2.01: cmd 35/00:00:3f:05:c3/00:04:09:00:00/f0 tag 0 dma 524288 out Mar 18 05:44:12 localhost kernel: res 40/00:cf:61:05:c3/40:03:09:00:00/f0 Emask 0x14 (ATA bus error) Mar 18 05:44:12 localhost kernel: ata2.01: status: { DRDY } 169,51 34% Mar 18 05:44:23 localhost kernel: ata2: hard resetting link Mar 18 05:44:24 localhost kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Mar 18 05:44:24 localhost kernel: ata2.01: revalidation failed (errno=-2) Mar 18 05:44:24 localhost kernel: ata2.01: disabled Mar 18 05:44:25 localhost kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x40) Mar 18 05:44:25 localhost kernel: ata2.00: revalidation failed (errno=-5) Mar 18 05:44:25 localhost kernel: ata2: failed to recover some devices, retrying in 5 secs Mar 18 05:44:30 localhost kernel: ata2: hard resetting link Mar 18 05:44:30 localhost kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Mar 18 05:44:30 localhost kernel: ata2.00: configured for UDMA/133 Mar 18 05:44:30 localhost kernel: ata2: EH complete
以下らへんを参考に復旧作業中だがだめだったらディスク買い換えるか、、、、
CentOSでソフトウェアRAIDの構築
smartd での sector error 復活作業
ext3でフォーマットする方法
HDDの不良ブロックをどうにかできないか – ubuntu badblocks mke2fs e2fsck
【追記】
badblocksコマンドで/dev/sdcのディスクについて不良セクタのチェックを行ったが特に検出されなかった(1TBのディスクなので--fullオプションをつけて実行したら一週間ぐらいかかった)
とりあえずRAID1を構築しなおして/dev/md0のRAIDデバイスについてext3でフォーマットしなおしたら復旧した。
ソフトウェアRAIDの書き込み失敗とかそういう論理的な問題だったのだろうか、、、、