SVX日記

2004|04|05|06|07|08|09|10|11|12|
2005|01|02|03|04|05|06|07|08|09|10|11|12|
2006|01|02|03|04|05|06|07|08|09|10|11|12|
2007|01|02|03|04|05|06|07|08|09|10|11|12|
2008|01|02|03|04|05|06|07|08|09|10|11|12|
2009|01|02|03|04|05|06|07|08|09|10|11|12|
2010|01|02|03|04|05|06|07|08|09|10|11|12|
2011|01|02|03|04|05|06|07|08|09|10|11|12|
2012|01|02|03|04|05|06|07|08|09|10|11|12|
2013|01|02|03|04|05|06|07|08|09|10|11|12|
2014|01|02|03|04|05|06|07|08|09|10|11|12|
2015|01|02|03|04|05|06|07|08|09|10|11|12|
2016|01|02|03|04|05|06|07|08|09|10|11|12|
2017|01|02|03|04|05|06|07|08|09|10|11|12|
2018|01|02|03|04|05|06|07|08|09|10|11|12|
2019|01|02|03|04|05|06|07|08|09|10|11|12|
2020|01|02|03|04|05|06|07|08|09|10|11|12|
2021|01|02|03|04|05|06|07|08|09|10|11|12|
2022|01|02|03|04|05|06|07|08|09|10|11|12|
2023|01|02|03|04|05|06|07|08|09|10|11|12|
2024|01|02|03|04|05|06|07|08|09|10|

2011-02-26(Sat) デグRAIDする

  気がつくと、ソフトRAIDがデグレードしていた。あまりにヒドイSMART情報が出たので、記念に貼っておく。こういうの、出そうと思って出せるものではないので、参考になるはず。

  そういえば、このHDDはファームウェア問題を持っていたっけ。24時間稼働で使っているので、発現確率は低いと見積もり、対処してなかった。フツーに壊れた今となってはどうでもイイ話だけど。

  とりあえず、替えのHDDもないので、無理矢理、再アレイしてみる。

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
 
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model:     ST3500320AS
Serial Number:    9QM6THGV
Firmware Version: SD15
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Feb 26 22:11:11 2011 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
 
General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 650) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 120) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103b)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   084   006    Pre-fail  Always       -       9800766
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       65
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2045
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       8698539546
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       19213
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       65
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       204
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   039   045    Old_age   Always   In_the_past 39 (31 161 39 38)
194 Temperature_Celsius     0x0022   039   061   000    Old_age   Always       -       39 (0 14 0 0)
195 Hardware_ECC_Recovered  0x001a   040   021   000    Old_age   Always       -       9800766
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
 
SMART Error Log Version: 1
ATA Error Count: 235 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 
Error 235 occurred at disk power-on lifetime: 19213 hours (800 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 e0  Device Fault; Error: ABRT
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  a1 00 00 00 00 00 e0 00  20d+21:02:03.382  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 e0 00  20d+21:02:03.361  IDENTIFY DEVICE
  e5 00 55 9d 00 32 e0 00  20d+21:02:03.359  CHECK POWER MODE
  a1 00 00 00 00 00 e0 00  20d+21:02:01.283  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 e0 00  20d+21:02:01.261  IDENTIFY DEVICE
 
Error 234 occurred at disk power-on lifetime: 19213 hours (800 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 e0
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 e0 00  20d+21:02:03.361  IDENTIFY DEVICE
  e5 00 55 9d 00 32 e0 00  20d+21:02:03.359  CHECK POWER MODE
  a1 00 00 00 00 00 e0 00  20d+21:02:01.283  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 e0 00  20d+21:02:01.261  IDENTIFY DEVICE
  e5 00 55 01 00 00 e0 00  20d+21:02:01.259  CHECK POWER MODE
 
Error 233 occurred at disk power-on lifetime: 19213 hours (800 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 e0
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  e5 00 55 9d 00 32 e0 00  20d+21:02:03.359  CHECK POWER MODE
  a1 00 00 00 00 00 e0 00  20d+21:02:01.283  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 e0 00  20d+21:02:01.261  IDENTIFY DEVICE
  e5 00 55 01 00 00 e0 00  20d+21:02:01.259  CHECK POWER MODE
  00 00 00 00 00 00 00 ff  20d+21:01:56.167  NOP [Abort queued commands]
 
Error 232 occurred at disk power-on lifetime: 19213 hours (800 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 e0  Device Fault; Error: ABRT
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  a1 00 00 00 00 00 e0 00  20d+21:02:01.283  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 e0 00  20d+21:02:01.261  IDENTIFY DEVICE
  e5 00 55 01 00 00 e0 00  20d+21:02:01.259  CHECK POWER MODE
  00 00 00 00 00 00 00 ff  20d+21:01:56.167  NOP [Abort queued commands]
  a1 00 00 00 00 00 e0 00  20d+21:01:24.676  IDENTIFY PACKET DEVICE
 
Error 231 occurred at disk power-on lifetime: 19213 hours (800 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 04 9d 00 32 e0
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 e0 00  20d+21:02:01.261  IDENTIFY DEVICE
  e5 00 55 01 00 00 e0 00  20d+21:02:01.259  CHECK POWER MODE
  00 00 00 00 00 00 00 ff  20d+21:01:56.167  NOP [Abort queued commands]
  a1 00 00 00 00 00 e0 00  20d+21:01:24.676  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 e0 00  20d+21:01:24.651  IDENTIFY DEVICE
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

  いかん、再アレイも失敗する。

# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md8 : active raid1 sdb8[2](F) sda8[0]
      244189056 blocks [2/1] [U_]
      
md7 : active raid1 sdb7[2](F) sda7[0]
      145893248 blocks [2/1] [U_]
      
md0 : active raid1 sdb1[2](F) sda1[1]
      16383872 blocks [2/1] [_U]
      
unused devices: <none>
md: bind<sdb1>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:1, dev:sdb1
 disk 1, wo:0, o:1, dev:sda1
md: recovery of RAID array md0
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 16383872 blocks.
md: bind<sdb7>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda7
 disk 1, wo:1, o:1, dev:sdb7
md: delaying recovery of md7 until md0 has finished (they share one or more physical units)
md: bind<sdb8>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda8
 disk 1, wo:1, o:1, dev:sdb8
md: delaying recovery of md8 until md7 has finished (they share one or more physical units)
md: delaying recovery of md7 until md0 has finished (they share one or more physical units)
usb 5-2: reset low speed USB device using uhci_hcd and address 3
usb 5-2: reset low speed USB device using uhci_hcd and address 3
md: md0: recovery done.
md: delaying recovery of md8 until md7 has finished (they share one or more physical units)
md: recovery of RAID array md7
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 145893248 blocks.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sdb1
 disk 1, wo:0, o:1, dev:sda1
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata4.00: BMDMA stat 0x25
ata4.00: cmd c8/00:10:d0:c9:96/00:00:00:00:00/e0 tag 0 dma 8192 in
         res 71/04:04:9d:00:32/00:00:00:00:00/e0 Emask 0x1 (device error)
ata4.00: status: { DRDY DF ERR }
ata4.00: error: { ABRT }
ata4.00: both IDENTIFYs aborted, assuming NODEV
ata4.00: revalidation failed (errno=-2)
ata4: soft resetting link
ata4.00: both IDENTIFYs aborted, assuming NODEV
ata4.00: revalidation failed (errno=-2)
usb 5-2: reset low speed USB device using uhci_hcd and address 3
ata4: soft resetting link
ata4.00: both IDENTIFYs aborted, assuming NODEV
ata4.00: revalidation failed (errno=-2)
ata4.00: disabled
ata4: EH complete
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 9882064
raid1: sdb1: rescheduling sector 9882032
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 9882080
raid1: sdb1: rescheduling sector 9882048
raid1: sdb1: rescheduling sector 9882056
raid1: sdb1: rescheduling sector 9882064
raid1: sdb1: rescheduling sector 9882072
raid1: sdb1: rescheduling sector 9882080
raid1: sdb1: rescheduling sector 9882088
raid1: sdb1: rescheduling sector 9882096
raid1: sdb1: rescheduling sector 9882104
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 200956448
raid1: Disk failure on sdb7, disabling device.
raid1: Operation continuing on 1 devices.
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 200957472
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 200958496
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 200959520
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 32767776
end_request: I/O error, dev sdb, sector 32767776
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sdb1, disabling device.
raid1: Operation continuing on 1 devices.
md: md7: recovery done.
md: recovery of RAID array md8
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 244189056 blocks.
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:0, dev:sdb1
 disk 1, wo:0, o:1, dev:sda1
RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:sda1
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 488394784
raid1: Disk failure on sdb8, disabling device.
raid1: Operation continuing on 1 devices.
md: md8: recovery done.
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 488395808
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 488396832
sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdb, sector 488397856
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda7
 disk 1, wo:1, o:0, dev:sdb7
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda7
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda8
 disk 1, wo:1, o:0, dev:sdb8
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda8

  あかん……HDD注文しよ。以下は、マトモな側のSMART。

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
 
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EADS-00L5B1
Serial Number:    WD-WCAU45330156
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Feb 26 22:10:41 2011 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (23400) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   169   166   021    Pre-fail  Always       -       6533
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       28
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       18784
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       28
194 Temperature_Celsius     0x0022   118   094   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.