OVH Community, your new community space.

Random Daily Outages and RAID Failing


Neil
07-01-2015, 11:05
Is it an SYS Server or a Kimsufi server? Also who directed you to the forum? You can report issues here http://www.soyoustart.com/en/contact...r-services.xml

Affix
06-01-2015, 19:39
Quote Originally Posted by Neil
Hi

Have you contacted the SYS Team requesting the disk to be replaced? If so do you have a ticket number?
Every time I talk to them about a hardware fault I get directed here, Billing fault I get directed here, I've just had a catastrophic raid failure!

Thanks OVH! Excellent customer support again!!

Neil
05-01-2015, 10:22
Hi

Have you contacted the SYS Team requesting the disk to be replaced? If so do you have a ticket number?

happyman
05-01-2015, 09:46
Quote Originally Posted by Affix
I keep forgetting SYS don't replace a single drive until they all fail. Guess I will go somewhere better.
I don't believe this is the case. If one drive is faulty they will replace the faulty drive. The only issue I can see is how long it will take....

Affix
05-01-2015, 09:10
Quote Originally Posted by heise
I hope you reported this via their website or your manager. And they like to have smartctl data for ALL hdd.
I keep forgetting SYS don't replace a single drive until they all fail. Guess I will go somewhere better.

heise
03-01-2015, 08:44
I hope you reported this via their website or your manager. And they like to have smartctl data for ALL hdd.

Affix
03-01-2015, 06:40
hi,

Over the past few weeks I have been having micro outages reported from New Relic, Pingdom and an on site nagios monitoring solution.

They happen every day between 2am and 7am BST.

As a long standing customer it does infuriate me and I am tempted to look else where. The graphs in the control panel look normal and show no signs of peak usage around these times. This is happening across all 4 IP Addresses assigned to my Server including the Main IP and 3 Virtual Machines.

My load averages are around 1.19 and the server is more than capable of handling that load.

According to dmesg
HTML Code:
e1000e: eth0 NIC Link is Down
virbr0: port 1(eth0) entering disabled state
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Something is pulling down the ethernet interface.

I also notice that one of my disks is beginning to fail in my RAID Set.

HTML Code:
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/33
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:00:00:3b:ad/00:04:1b:00:00/e0 tag 0 dma 524288 in
         res 51/40:5e:a2:3e:ad/00:00:1b:00:00/0b Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/33
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:00:00:3b:ad/00:04:1b:00:00/e0 tag 0 dma 524288 in
         res 51/40:5e:a2:3e:ad/00:00:1b:00:00/0b Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/33
sd 0:0:0:0: [sda] Unhandled sense code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
        1b ad 3e a2 
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
sd 0:0:0:0: [sda] CDB: Read(10): 28 00 1b ad 3b 00 00 04 00 00
ata1: EH complete

SDA Smartctl (smartctl -a /dev/sda)
HTML Code:
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-431.11.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS724020ALA640
Serial Number:    PN2134P6H371WP
LU WWN Device Id: 5 000cca 22dcf8f1f
Firmware Version: MF6OAA70
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Jan  3 07:49:18 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   24) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 317) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   055   055   016    Pre-fail  Always       -       650453851
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       80
  3 Spin_Up_Time            0x0007   133   133   024    Pre-fail  Always       -       469 (Average 468)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       23
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       143
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   145   145   020    Pre-fail  Offline      -       24
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       8063
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       23
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       402
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       402
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 23/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       143
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       13
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1953 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1953 occurred at disk power-on lifetime: 7942 hours (330 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e a2 3e ad 0b  Error: UNC 94 sectors at LBA = 0x0bad3ea2 = 195903138

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 00 3b ad e0 00   1d+06:29:38.054  READ DMA EXT
  ef 10 02 00 00 00 a0 00   1d+06:29:38.054  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   1d+06:29:38.054  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+06:29:38.053  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00   1d+06:29:38.053  SET FEATURES [Set transfer mode]

Error 1952 occurred at disk power-on lifetime: 7942 hours (330 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e a2 3e ad 0b  Error: UNC 94 sectors at LBA = 0x0bad3ea2 = 195903138

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 00 3b ad e0 00   1d+06:29:35.224  READ DMA EXT
  ef 10 02 00 00 00 a0 00   1d+06:29:35.224  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   1d+06:29:35.224  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+06:29:35.223  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00   1d+06:29:35.222  SET FEATURES [Set transfer mode]

Error 1951 occurred at disk power-on lifetime: 7942 hours (330 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e a2 3e ad 0b  Error: UNC 94 sectors at LBA = 0x0bad3ea2 = 195903138

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 00 3b ad e0 00   1d+06:29:32.485  READ DMA EXT
  ef 10 02 00 00 00 a0 00   1d+06:29:32.485  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   1d+06:29:32.485  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+06:29:32.483  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00   1d+06:29:32.483  SET FEATURES [Set transfer mode]

Error 1950 occurred at disk power-on lifetime: 7942 hours (330 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e a2 3e ad 0b  Error: UNC 94 sectors at LBA = 0x0bad3ea2 = 195903138

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 00 3b ad e0 00   1d+06:29:29.696  READ DMA EXT
  ef 10 02 00 00 00 a0 00   1d+06:29:29.696  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   1d+06:29:29.696  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+06:29:29.695  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00   1d+06:29:29.695  SET FEATURES [Set transfer mode]

Error 1949 occurred at disk power-on lifetime: 7942 hours (330 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 5e a2 3e ad 0b  Error: UNC 94 sectors at LBA = 0x0bad3ea2 = 195903138

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 00 3b ad e0 00   1d+06:29:26.900  READ DMA EXT
  ef 10 02 00 00 00 a0 00   1d+06:29:26.899  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   1d+06:29:26.899  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+06:29:26.898  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00   1d+06:29:26.898  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         8         -
# 2  Short offline       Completed without error       00%         4         -
# 3  Short offline       Completed without error       00%         4         -
# 4  Short offline       Completed without error       00%         0         -
# 5  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.