OVH Community, your new community space.

SSD failure on a RAID 1 setup

slawek22

25-06-2010, 23:50

/proc/mdstat will show no physical disk state (just raid status), so it won't

kern.log will hold all the R/W errors, that should be enough.

Sometimes disk could be dropped from the array due to R/W timeout, however. Disks can reallocate bad/slow sectors so it'll appear good after dropping from the array (i don't know how SSD's handle bad sectors).

wminside

25-06-2010, 17:40

If you see that one of the disks failed checking /proc/mdstat maybe the related /var/log/messages output would be enough info for OVH's techs?

slawek22

25-06-2010, 15:03

what 4?

yonatan

25-06-2010, 14:53

Originally Posted by slawek22

The disk will be dropped from the array and the work will continue as usual in all cases (for single disk read error / failure).

For multiple disk failures (when there's only one disk left on RAID1 array), you'll get read error but the disk won't be drpped. So... in most cases server will still work (if error won't hit system files area).

Eg. for mysql (If you only have databases on SSD which is best) you'll get "connection lost" error when reading affected row, and the work will continue. Same for apache in prefork - it'll fail to serve single file (it'll be much worse if the error hit some script that is included sitewide).

You can't do fsck on live filesystem or on an unmounted raid mirror. That'll have no purpose anyway. You just use mdadm to add the drive back to the array.

You can run "badblocks" or "smartctl" (short/long test) on the failed drive to identify the errors (if any)... before adding it to the raid back so it won't affect the site performance.

For OVH techs...
* i think the replace time is awesome (for me about 10-15 minutes). You should give MODEL, SERIAL NUMBER, ID (/dev/sdX), ERRORS (smart, kern, badblocks, etc.). Just run smartctl --all /dev/sdxX and copy/paste.

* They of course won't replace the drive by themselves (it's impossible to tell if drive is damaged, it could be good even when there were some read/write errors before). So you must file a ticket (not ordinary one, but the other one, that is used to report damaged components).

* The bad thing is i think they'll just "pull off the plug". So you get damaged files and broken database table on your "good" disk. Then fsck will be done, and you'll be facing very painfull process of database check / recovery (if you have very big myISAM tables and you probably should rather be using myisam than innoDB when storing data on SSD).

That's really strange as techs have SSH access, can login and shutdown the server. OVH have recently developed SMS sending service so if they can't shutdown by hand they could just send SMS telling (shutdown the server in 1 minute). So i think, better to shut down everything before replacement.

http://tldp.org/HOWTO/Software-RAID-0.4x-HOWTO-4.html
read this again.

slawek22

25-06-2010, 10:37

The disk will be dropped from the array and the work will continue as usual in all cases (for single disk read error / failure).

For multiple disk failures (when there's only one disk left on RAID1 array), you'll get read error but the disk won't be drpped. So... in most cases server will still work (if error won't hit system files area).

Eg. for mysql (If you only have databases on SSD which is best) you'll get "connection lost" error when reading affected row, and the work will continue. Same for apache in prefork - it'll fail to serve single file (it'll be much worse if the error hit some script that is included sitewide).

you will have to try to check the disk ( fsck )
then you try to rebuild ( mdadm )
if any of the cases fail ( check the disk first ) , you send a ticket for a replacement ,after the reboot it should rebuild in most cases.

You can't do fsck on live filesystem or on an unmounted raid mirror. That'll have no purpose anyway. You just use mdadm to add the drive back to the array.

You can run "badblocks" or "smartctl" (short/long test) on the failed drive to identify the errors (if any)... before adding it to the raid back so it won't affect the site performance.

For OVH techs...
* i think the replace time is awesome (for me about 10-15 minutes). You should give MODEL, SERIAL NUMBER, ID (/dev/sdX), ERRORS (smart, kern, badblocks, etc.). Just run smartctl --all /dev/sdxX and copy/paste.

* They of course won't replace the drive by themselves (it's impossible to tell if drive is damaged, it could be good even when there were some read/write errors before). So you must file a ticket (not ordinary one, but the other one, that is used to report damaged components).

* The bad thing is i think they'll just "pull off the plug". So you get damaged files and broken database table on your "good" disk. Then fsck will be done, and you'll be facing very painfull process of database check / recovery (if you have very big myISAM tables and you probably should rather be using myisam than innoDB when storing data on SSD).

That's really strange as techs have SSH access, can login and shutdown the server. OVH have recently developed SMS sending service so if they can't shutdown by hand they could just send SMS telling (shutdown the server in 1 minute). So i think, better to shut down everything before replacement.

jonlewi5

25-06-2010, 07:17

As i had very limited experience with raid myself, i wondered what would happen aswell if a disk failed, found this page which gave info about simulating a drive failure

https://raid.wiki.kernel.org/index.p...ng_and_testing

Obviously backup data before trying anything though.

I played about with is using standard platter drives though, so wether the drives being SSD's would make a difference i dont know, cant see it being any different though.

yonatan

25-06-2010, 00:18

Originally Posted by wminside

I'm really happy with my MG SSD server. I set it up to use RAID 1. What would exactly happen if one of the drives stops working?

To be more specific, would the server stop working (pinging) and OVH's team would replace the damaged disk without me opening a ticket? I not, how can you prove that one of the drives is the cause of the server being down?

When the troubled drive get's replaced what would I have to do? Maybe the RAID1 "rebuilds" itself on boot after replacing the drive?

The server in most cases will not stop working, and the raid array will be degraded.

you will have to try to check the disk ( fsck )
then you try to rebuild ( mdadm )
if any of the cases fail ( check the disk first ) , you send a ticket for a replacement ,after the reboot it should rebuild in most cases.

You can always see your array status with:

# cat /proc/mdstat

I am not sure about the fail rate of the SSD drives, but like any disk ... it happens sometimes.

maybe the guys at ovh could build up a chart for us about faulty disk rates .. ( expected lifetime? ).

wminside

25-06-2010, 00:12

I'm really happy with my MG SSD server. I set it up to use RAID 1. What would exactly happen if one of the drives stops working?

To be more specific, would the server stop working (pinging) and OVH's team would replace the damaged disk without me opening a ticket? I not, how can you prove that one of the drives is the cause of the server being down?

When the troubled drive get's replaced what would I have to do? Maybe the RAID1 "rebuilds" itself on boot after replacing the drive?