Hello,
Yesterday afternoon and up to late last night, we experienced an important incident, which has impacted 20% of our RPS customers.
This is a problem with the electricity supply on 8 SAN. A week ago, the electrical team worked in the SAN room on one of the 2 electric supplies (in order to add the new SAN). In total, we have more than 40 SAN in production in this room and 120 eventually. For this work, they therefore stopped one of the supplies but after the end of the work, they made a human error when reconnecting 8 SAN. Yesterday, during the generator tests, the disconnected 8 SAN had a power supply fault and broke down. The fault was corrected quickly but it takes a few hours for the SAN to bring the service back up again. The duration of the problem came from a bug in Solaris, which causes a delay to put a SAN back again of between 2 and 12 hours (according to the number of file systems to mount with or without snapshots). We are working with SUN on the improvement of restart time of a SAN but at the moment we still have this bug. To sum up, 18% of RPS broke down for 2h and 2% for 12h (a SAN takes a lot of time to remount). We are also looking how we can avoid this kind of human error in future.
All customers that were affected by this problem will get 1 month free. By Tuesday at the latest, an email will be sent to them with a form to fill out.
Sorry for the inconvenience caused.
Find out more:
http://travaux.ovh.net/?do=details&id=2798
EN version:
http://translate.google.com/translat...hl=EN&ie=UTF-8
http://travaux.ovh.net/?do=details&id=2744
EN version:
http://translate.google.com/translat...hl=EN&ie=UTF-8
Regards,
Octave