OVH Community, your new community space.

Broken disk with + Realloc FAIL


slawek22
20-06-2010, 21:24
Okay, problem fixed. Thanks all for the help.

There's a nice tool: ddrescue (+partprobe). I was able to copy everything from the disk with read errors to working one.

Then i resized the partitions, split them by half so for now it works (i have unused partition on the area of the broken drive where errors are). I'll be going for disk replacement soon.

slawek22
19-06-2010, 17:12
Sure, but it's production server so i want to do everything online with as little interruptions as possible.

So i have kind of "DEADLOCK" situation here now Can't fix the broken disk without dropping it from an array, and the broken disk is the only one working.

Maybe you heard about some param of mdadm that can add disk back to the array ignoring broken sectors? Some hardware controllers have "restore broken drive" option (AFAIR) that does it...?

----

I always get mad when i hear something like "run the server in RESCUE", test the disk and send us results from OVH techs Then i think "so what's the F* raid here for? I can drop the whole disk from the array anyway and do any tests they want with server working".

Anyway just a little offtopic, that policy needs to be changed.

Myatu
19-06-2010, 15:59
You can boot the machine in Rescue mode. You can (un)mount the drives individually, regardless of raid (though be careful with the partitions). The long test is online, yes - but the heavier the use, the longer it will take of course.

slawek22
19-06-2010, 15:34
Thanks Myatu for the suggestion, already found the link.

It appears that dd can't write to drive that is a part of an array (and the drive can't be removed from array as it's broken and the only working one for now).

Any other suggestions on how to write a block to a device that's in use?
(or copy block from SDB to SDA that should trigger re-allocation). The problem is that dd just doesn't do nothing (locks up).

What about the smartctl long test? Is it online? (server can be up and reading/writing to disk while it'll run?)

Thx.

Myatu
19-06-2010, 13:57
Right, the one thing left really is to do a full check on SDA using "smartctl -t long /dev/sda", which takes a LONG time complete - definitely a day+! This will check each sector and re-allocate it with a spare if possible/needed.

Hopefully this is all that's needed in your case (after which you do a fsck), but here's a good guide on how to force bad blocks (thus avoid it being used by the fs as well as raid): http://smartmontools.sourceforge.net/badblockhowto.html.

You're in for a long weekend!

_Lemon_
19-06-2010, 12:10
Time to back up the data that you do have access to and reinstall the entire thing, I guess.

slawek22
19-06-2010, 08:11
Mirror is BROKEN. So no, i have NO MIRROR.

MY RAID1 STATUS:
Update Time : Sat Jun 19 09:06:29 2010
State : active, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 8 18 1 spare rebuilding /dev/sdb2

REBUILD WILL NEVER FINISH

/SDA is broken (READ ERRORS)
/SDB is GOOD (BUT GOT DROPPED FROM THE ARRAY EARLIER)

SO DATA ISNT IN SYNC ON GOOD DRIVE...

Solutions ?

yonatan
18-06-2010, 19:05
Raid 1 is a mirror , there is only 1 versions of the files on both , no "old" and " new" copy of the data...
whatever you had on sda should be on sdb.

slawek22
18-06-2010, 18:04
RAID 1, but i cannot get the "good" drive back to the array without doing sync from the bad ("read errors") drive it uses now.

If it's 1, you simply do a rebuild
Cant simply rebuild (ADD SDB). It gives read errors (from broken SDA) and the array is still "degraded" when the process finishes.

Myatu
18-06-2010, 17:59
Raid 0 or 1? If it's 1, you simply do a rebuild. If it's 0, your only option is to backup whatever is still readable.

slawek22
18-06-2010, 17:33
Thanks for the answer.

The problem is that the only drive in raid that have recent data is broken (SDA).

SDB should be good, but have outdated data (>1 month). So i must figure out something before they replace the SDA to have all the recent data on SDB.

Will i need to copy partition table to SDB too? (i heard that servers in OVH can't boot from SDB "automatically")

yonatan
18-06-2010, 17:21
Submit a ticket stating this log, they will replace your drive for sure.

no use trying to rebuild a disk with I/O errors.

slawek22
18-06-2010, 17:15
Ok long story short, my disks (SUPERPLAN) were getting dropped out of the RAID for the last 6 months randomly.

For now i have 6 pending sectors for SDA + several read errors.

Code:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       7
Code:
hdparm --read-sector 1460194261 /dev/sda
/dev/sda:
reading sector 1460194261: FAILED: Input/output error

hdparm --read-sector 1460194262 /dev/sda
[OK]

hdparm --read-sector 1460194263 /dev/sda
/dev/sda:
reading sector 1460194261: FAILED: Input/output error
This is SDA, but SDB dropped from the array already. And i can't rebuild it because of read errors on SDA.

How can i copy the sectors from SDB to SDA (eg. to trigger auto reallocation). Is that enough for OVH to replace the drives? (or at least SDA, SDB looks OK for now, it's dropping from the array, though)