OVH Community, your new community space.

Raid rebuild


Neil
05-11-2009, 11:53
Hi

No problem, it did take a bit longer because they wanted to be sure you had backed up your data.

derchris
04-11-2009, 20:01
Thanks Neil,

I can see the other disk has been replaced as well.

Neil
04-11-2009, 13:31
Hi Chris

It can take hours but not weeks, most likely it is because the other drive has performance problems as well. I have asked a technician to look at your ticket.

derchris
03-11-2009, 19:25
Does anyone have any estimates for a Raid 1 rebuild?
Started one at 14:00, almost 5 hrs later, it is at 2%.
This would take weeks ...

derchris
03-11-2009, 14:04
Problem is both disks were showing errors.
So I suspected that both get replaced.

Andy
03-11-2009, 14:00
Erm, doesn't that defeat the point of the disk change?

derchris
03-11-2009, 13:59
One disk replaced.
But the one which was showing as DEGRADED still in

derchris
02-11-2009, 22:33
meh, both disk are gone
Currently opening a ticket to get both replaced, hopefully not with another bunch of Seagate disk.
Funny though, there is the following entry when creating an incident ticket:

"The firmware of my Seagate hard disk is faulty"

derchris
02-11-2009, 19:22
Have you checked the logs for FS errors?
Smart really should show something if there something wrong with the HDD.
Some of the counters should increase at least.

gigabit
02-11-2009, 16:48
Anyone ever manager to get a tech guy to replace hard drives when rescue-pro says everything is fine and so does SMART?

Been testing my "new" server (already had a proper faulty disk replaced) but now I've got files corrupting over time. I downloaded a fedora DVD iso, and the hash has changed since downloading same with some RAR files which now have CRC errors.

Myatu
02-11-2009, 02:07
Yeah, I'm stuck with Seagates... Blech! Hence that RAID error I was having. It's a soft one, so after it marked the sector bad it and a spare allocated, it was fine... But still, never had luck with those. I love Samsung though! Never had issues with them, plus they're quiet

derchris
02-11-2009, 02:03
I use only Samsung drives. 6x1.5 TB in my home PC

Andy
02-11-2009, 01:49
I prefer Samsung's, especially F1/F2 or F3's for the 1TB/1.5TB drives. Have 4x1TB F1's and a 1.5TB F2 myself on my home server.

Dave
02-11-2009, 01:48
I agree with Andy Seagate is terrible, but my new server came with two of these so I'm happy:

WD Caviar Black 750 GB SATA Hard Drives ( WD7501AALS )

funnily enough they are the same drives I picked up for my home PC not so long ago and are much quicker then my "GREEN" maxtor jobbies (yep bought a green hard drive by accident a long time ago)

derchris
02-11-2009, 01:09
FTP Backup still running at the moment.
Once that is done I will check the Raid and the drives.
Need to get some answers from OVH, then I can upgrade the server with all my stuff.
Don't want to put that much effort in it at the moment, as I was going to upgrade my server anyway.
Just odd that I only found out about the problem after reading all the other threads about HDD errors, and then checking mine.
No problems have been reported so far.
Need to set some sort of notification/alarm for the next one.

Andy
02-11-2009, 01:02
Ouch, Seagate... There's your problem! I will never ever trust a Seagate again as long as I may live after the 5 failures I suffered this year from identical Seagate drives...

Get replacements and rebuild it, simple really. And make sure its not a Seagate!

gigabit
01-11-2009, 22:58
Sounds alot like problems I've been having, with data corrupting randomly on the drives.

derchris
01-11-2009, 21:01
As predicted.
During a tar I got FS/Journal errors, and it is now mounted RO.
Currently doing a ftp tar on the fly, and will then boot into Resuce and check.
Maybe all what is needed is a fsck, hopefully

derchris
01-11-2009, 20:03
After reading several posts about HDD errors, I checked my Raid 1 as well and funny enough it was showing as DEGRADED.
Here is the tw_cli info c0:

PHP Code:
Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    DEGRADED       -      -       698.637   ON     -        -

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     698.63 GB   1465149168    5QK0L8CC
p1     DEGRADED         u0     698.63 GB   1465149168    5QK0LHLN 
So i did the following:

PHP Code:
# tw_cli maint remove c0 p1
# tw_cli maint rescan c0
# tw_cli maint createunit c0 rspare p1
# tw_cli maint rebuild c0 u0 p1 
It was showing as REBUILDING after that.
Checked again 1-2 hrs later, but it is back to DEGRADED.

I then checked smart status of the p1 disk:

PHP Code:
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model
:     ST3750330AS
Serial Number
:    5QK0LHLN
Firmware Version
SD15
User Capacity
:    750,156,374,016 bytes
Device is
:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is
:  ATA-8-ACS revision 4
Local Time is
:    Sun Nov  1 18:53:55 2009 GMT
SMART support is
Available device has SMART capability.
SMART support isEnabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test resultPASSED
See vendor
-specific Attribute list for marginal Attributes.

SMART Attributes Data Structure revision number10
Vendor Specific SMART Attributes with Thresholds
:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  
1 Raw_Read_Error_Rate     0x000f   104   096   006    Pre-fail  Always       -       53611689
  3 Spin_Up_Time            0x0003   094   093   000    Pre
-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       
-       29
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre
-fail  Always       -       2044
  7 Seek_Error_Rate         0x000f   081   060   030    Pre
-fail  Always       -       8850576272
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       
-       9701
 10 Spin_Retry_Count        0x0013   100   100   097    Pre
-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       
-       29
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       
-       0
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       
-       2
188 Unknown_Attribute       0x0032   099   099   000    Old_age   Always       
-       4295032833
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   059   035   045    Old_age   Always   In_the_past 41 
(7 41 60 37)
194 Temperature_Celsius     0x0022   041   064   000    Old_age   Always       -       41 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   023   010   000    Old_age   Always       -       53611689
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       
-       0

SMART Error Log Version
1
ATA Error Count
2

Powered_Up_Time is measured from power on
, and printed as
DDd+hh:mm:SS.sss where DD=dayshh=hoursmm=minutes,
SS=sec, and sss=millisecIt "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime7773 hours (323 days 21 hours)
  
When the command that caused the error occurredthe device was active or idle.

  
After command completion occurredregisters were:
  
ER ST SC SN CL CH DH
  
-- -- -- -- -- -- --
  
40 51 00 ff ff ff 0f  ErrorUNC at LBA 0x0fffffff 268435455

  Commands leading to the command that caused the error were
:
  
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  
-- -- -- -- -- -- -- --  ----------------  --------------------
  
25 00 18 ff ff ff ef 00      10:35:19.460  READ DMA EXT
  25 00 18 ff ff ff ef 00      10
:35:16.210  READ DMA EXT
  25 00 08 ff ff ff ef 00      10
:35:16.196  READ DMA EXT
  25 00 40 ff ff ff ef 00      10
:35:16.195  READ DMA EXT
  25 00 40 ff ff ff ef 00      10
:35:16.194  READ DMA EXT

SMART Self
-test log structure revision number 1
No self
-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self
-test flags (0x0):
  
After scanning selected spans, do NOT read-scan remainder of disk.
If 
Selective self-test is pending on power-upresume after 0 minute delay
Which is not looking good.
Out of interest I did the same for the p0 disk:

PHP Code:
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model
:     ST3750330AS
Serial Number
:    5QK0L8CC
Firmware Version
SD15
User Capacity
:    750,156,374,016 bytes
Device is
:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is
:  ATA-8-ACS revision 4
Local Time is
:    Sun Nov  1 18:54:46 2009 GMT
SMART support is
Available device has SMART capability.
SMART support isEnabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test resultPASSED
See vendor
-specific Attribute list for marginal Attributes.

SMART Attributes Data Structure revision number10
Vendor Specific SMART Attributes with Thresholds
:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  
1 Raw_Read_Error_Rate     0x000f   103   099   006    Pre-fail  Always       -       168335976
  3 Spin_Up_Time            0x0003   094   093   000    Pre
-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       
-       29
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre
-fail  Always       -       3
  7 Seek_Error_Rate         0x000f   073   060   030    Pre
-fail  Always       -       155398418799
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       
-       9701
 10 Spin_Retry_Count        0x0013   100   100   097    Pre
-fail  Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       
-       29
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       
-       0
187 Reported_Uncorrect      0x0032   076   076   000    Old_age   Always       
-       24
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       
-       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   062   038   045    Old_age   Always   In_the_past 38 
(4 63 57 33)
194 Temperature_Celsius     0x0022   038   062   000    Old_age   Always       -       38 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   046   018   000    Old_age   Always       -       168335976
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      
-       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       
-       0

SMART Error Log Version
1
ATA Error Count
24 (device log contains only the most recent five errors)

Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=dayshh=hoursmm=minutes,
SS=sec, and sss=millisecIt "wraps" after 49.710 days.

Error 24 occurred at disk power-on lifetime9698 hours (404 days 2 hours)
  
When the command that caused the error occurredthe device was active or idle.

  
After command completion occurredregisters were:
  
ER ST SC SN CL CH DH
  
-- -- -- -- -- -- --
  
40 51 00 a7 3f 03 00  ErrorUNC at LBA 0x00033fa7 212903

  Commands leading to the command that caused the error were
:
  
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  
-- -- -- -- -- -- -- --  ----------------  --------------------
  
25 00 40 80 3f 03 e0 00  30d+22:50:33.024  READ DMA EXT
  25 00 40 80 3f 03 e0 00  30d
+22:50:15.247  READ DMA EXT
  25 00 40 c0 ab 02 e0 00  30d
+22:50:10.273  READ DMA EXT
  25 00 40 80 ab 02 e0 00  30d
+22:50:10.263  READ DMA EXT
  25 00 40 40 ab 02 e0 00  30d
+22:50:10.262  READ DMA EXT
 
SMART Self
-test log structure revision number 1
No self
-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self
-test flags (0x0):
  
After scanning selected spans, do NOT read-scan remainder of disk.
If 
Selective self-test is pending on power-upresume after 0 minute delay
So it is showing the same error, but showing as OK in the Raid config.
My guess is, it sooner or later will fail, so a replacement of either one or both is in order.

Any other thoughts?
Was going to replace the server anyway at some time, so looks like a good time now. Just need to get the backups rolling before it will die completely.