OVH Community, your new community space.

RPS Incident of 22nd January

rickyday

24-01-2009, 12:11

Originally Posted by fozle

I take it you don't agree with this. Mind you it can be difficult to estimate the impact on different customers, 2 hours offline for one customer could be alot worse for her than 12 hours offline for another. I think what's important right now is to make sure all those that were impacted are compensated in a timely fashion, that's just my opinion though.

In my opinion you're opinion is right

fozl

23-01-2009, 14:33

I take it you don't agree with this. Mind you it can be difficult to estimate the impact on different customers, 2 hours offline for one customer could be alot worse for her than 12 hours offline for another. I think what's important right now is to make sure all those that were impacted are compensated in a timely fashion, that's just my opinion though.

unclebob

23-01-2009, 14:26

"To sum up, 18% of RPS broke down for 2h and 2% for 12h"

"All customers that were affected by this problem will get 1 month free"

fozl

23-01-2009, 14:23

Where does it say that?

unclebob

23-01-2009, 13:28

It's great to see the people that had a 2 hour break in service are getting the same compensation as those of us experiencing 20+ hours downtime...

oles@ovh.net

23-01-2009, 08:30

Hello,

Yesterday afternoon and up to late last night, we experienced an important incident, which has impacted 20% of our RPS customers.

This is a problem with the electricity supply on 8 SAN. A week ago, the electrical team worked in the SAN room on one of the 2 electric supplies (in order to add the new SAN). In total, we have more than 40 SAN in production in this room and 120 eventually. For this work, they therefore stopped one of the supplies but after the end of the work, they made a human error when reconnecting 8 SAN. Yesterday, during the generator tests, the disconnected 8 SAN had a power supply fault and broke down. The fault was corrected quickly but it takes a few hours for the SAN to bring the service back up again. The duration of the problem came from a bug in Solaris, which causes a delay to put a SAN back again of between 2 and 12 hours (according to the number of file systems to mount with or without snapshots). We are working with SUN on the improvement of restart time of a SAN but at the moment we still have this bug. To sum up, 18% of RPS broke down for 2h and 2% for 12h (a SAN takes a lot of time to remount). We are also looking how we can avoid this kind of human error in future.

All customers that were affected by this problem will get 1 month free. By Tuesday at the latest, an email will be sent to them with a form to fill out.

Sorry for the inconvenience caused.

Find out more:

http://travaux.ovh.net/?do=details&id=2798
EN version: http://translate.google.com/translat...hl=EN&ie=UTF-8

http://travaux.ovh.net/?do=details&id=2744
EN version: http://translate.google.com/translat...hl=EN&ie=UTF-8

Regards,

Octave