OVH Community, your new community space.

Network Outage :(


Neil
26-07-2010, 11:11
Quote Originally Posted by ictdude
Do somebody know what the impact will be changing this router ?
Downtime ? What network parts will be affected ? Monday evening ... Hope they do it after 01:00 AM are so. Then my customers are a sleep ... pppfff sweat ...
Hi

Sorry for the downtime on Saturday but we worked as quick as possible and everything was resolved as quick as it could be.

For Monday evening it will be done around midnight and there will be a cut on the Brussels link, but all the information can be found here, http://status.ovh.co.uk/?do=details&id=285

ictdude
25-07-2010, 19:07
Quote Originally Posted by HugeServer
And, read this page too : http://translate.google.com/translat...11&sl=fr&tl=en

There is another routers changing : Tonight (July 25)
et lundi 27 juillet soir.
and Monday, July 27 evening.
Do somebody know what the impact will be changing this router ?
Downtime ? What network parts will be affected ? Monday evening ... Hope they do it after 01:00 AM are so. Then my customers are a sleep ... pppfff sweat ...

Myatu
25-07-2010, 18:36
Quote Originally Posted by LawsHosting
We have to give them credit though, when a fault occours, they do get onto the problem very quickly.......Unlike some commpanies I've delt with in the past.
Very true. Another DC I know of ran into a similar issue a few months ago, with cascading router failure... It took them half a day to get it working again. At least these routers were back up in a short timespan, be it a little "iffy" for a bit (with another outage).

This is the only major downtime I know of at OVH; apparently the first according to Oles, so quite a feat in the lifetime of the company. I'm happy still.

Oh, PS, forgot to mention: OVH does Travaux updates on Twitter as well (have been for quite a while): @ovhteam

LawsHosting
25-07-2010, 17:54
We have to give them credit though, when a fault occours, they do get onto the problem very quickly.......Unlike some commpanies I've delt with in the past.

HugeServer
25-07-2010, 11:53
And another updates : http://travaux.ovh.com/?do=details&id=4412

HugeServer
25-07-2010, 00:06
And, read this page too : http://translate.google.com/translat...11&sl=fr&tl=en

There is another routers changing : Tonight (July 25)
et lundi 27 juillet soir.
and Monday, July 27 evening.

HugeServer
24-07-2010, 23:54
Translate of What oles said in the mails

From: oles@ovh.net
Sent: Sunday, July 25, 2010 12:39:25 AM
To: XXXXXXXXXX
--------------------------------------------

on a presque trouvé. on a 2 travaux importants sur le reseau: cette
nuit et lundi soir. les travaux sur roubaix/london sur la fibre.
puis lundi roubaix/bruxelles. du coup on a boosté la fin de travaux
sur les securisations du reseau à travers london/amsterdam et
frankfurt/paris. ça a été monté jeudi. et donc cette nuit le routage
va couper entre roubaix/london et va passer par amsterdam. puis lundi
ça va couper vers bruxelles et ça va passer par london pour amsterdam
et par paris vers frankfurt.

alors aujourd'hui on attendait les travaux (pas d'ovh night ...)
puis plouf le routeur à london a crashé et n'est pas revenu, un
probleme de boot, on a dû le reprendre en cable serie pour le
finir de booter. dans la foulé frankfurt a crashé pour cause de
la mémoire et ça ça nous a fait tilté. puis comme si le bonheure
n'était pas complet amsterdam a aussi crashé. en 1 heure 3 routeurs.
j'ai jamais veçu ça ... on a mis du temps à restabiliser tout ça
et surtout remonter frankfurt. et qui dit frankfurt dit zurick
milano, prague et vienne. la fete est complete ....

après l'analyse on pense qu'avec les securisations qu'on a mis en
place et les données à syncroniser en plus, les routeurs ont été
full en ram non fragmenté. mais il y a quelque chose avec le MPLS
pas clair. sans ça marche. avec le routeur n'a plus de RAM. du coup
on a coupé MPLS et remis lien par lien. c'est stable avec 200Mo de
libre sur 1Go. j'ai pas tout compris.

l'une des solutions on a commandé il y a 3 semaines arrive dans 5 semaines
quand on aura reçu les ASR1000 pour les routes collector. on va ainsi
diminuer le nombre de sessions BGP par routeur et simplifier le type
de BGP, pas de reflector. ça va prendre aussi moins de CPU et moins
de RAM. surtout quand le reseau est securisé, les mêmes informations
arrivent par differents chemins à differents moment. et donc ça fait
beaucoup de calcul. en gros, la configuration actuelle est arrivé à
son bout et il faut la faire évoluer. ça sera fait.

puis il faut changer les cisco 6509 par les nexus 7016. on a reçu
un pour le labo et on attend qu'on puisse en commander 5 pour la
backbone, parce que ... les cartes qu'il nous faut n'existent pas
encore. seulement à partir de mois de septembre ...

mais bon là ça rale pour autre chose: lorsque le routage change et
il faut recalculer les tables BGP, les vss ont vraiment du mal. ils
prennent 100% du CPU pendant très long temps et le process ARP ne
repond pas aux demandes ARP du reseau. du coup les serveurs de
clients expirent la MAC du routeur et ne recoivent pas la reponse
à la requete. et ça ping pas. avec le route collector on va diminuer
le CPU pour le BGP, mais ce probleme là restera ... il faudra qu'on
se code un soft de proxy-arp en spoof de la MAC du routeur pour
toutes les demandes autre.
we almost found. it has two important works on the network: this
night and Monday night. work on roubaix / london on the fiber.
then Monday roubaix / brussels. the coup was boosted by the end of work
on network security through london / amsterdam and
frankfurt / paris. it was installed Thursday. so this night routing
will cut between roubaix / london and will go through amsterdam. then Monday
it'll cut to Brussels and it will pass by for london amsterdam
and paris to frankfurt.

so today was expected to work (no ovh night ...)
then plop the router has crashed in london and did not return a
boot problem was due to resume in the serial cable to
finish booting. trampled in frankfurt crashed due to
memory and that it has clicked. then as if happiness
was not complete amsterdam also crashed. in 1 hour 3 routers.
I've never been there ... it took time to re-stabilize it all
and especially back frankfurt. and who says frankfurt said Zurick
milano, Prague and Vienna. the feast is complete ....

After the analysis we think that with the safety features that has
up and data synchronization and more, routers have been
full ram in unfragmented. but there is something with the SPLM
unclear. without it works. with the router has more RAM. the coup
was cut and handed MPLS link by link. is stable with 200MB of
1GB free. I have not completely understood.

one of the options we ordered three weeks ago arrived in 5 weeks
when we have received the ASR1000 for collector roads. it goes well
reduce the number of BGP sessions in router and simplify the type
BGP, no reflector. it will also take less CPU and less
RAM. especially when the network is secure, the same information
arrive by different paths at different times. and therefore it is
much calculation. roughly the current configuration is reached
its end and it must evolve. it will be done.

then we must change the Cisco Nexus 6509 by 7016. we received
one for the lab and as can be expected for the order 5
backbone, because ... cards that we need not exist
yet. only from September ...

but right now it's al for something else: when the exchange and routing
must recalculate the BGP tables, the vss are really struggling. they
take 100% CPU for a very long time and the PRA process
not answer to ARP requests from the network. the coup servers
customers expire the router MAC and do not receive the response
to the application. and it not ping. with the collector road will be reduced
the CPU to the PMO, but this problem will remain there ... we'll have
code is a soft proxy-arp spoof the MAC of the router for
all other applications.

Speedy059
24-07-2010, 23:38
Servers going offline for some reason. Can they give us a warning of some sort? This is a horrible way to attract businesses, and a great way to scare them off.

turbanator
24-07-2010, 22:56
yeah this is such bs all the time network is going down

http://status.ovh.net/?do=details&id...0758b28cc4ce82

Tz-OVH
24-07-2010, 22:55
Aaaaaaand its down again.

oles, have you heard of twitter? It's a good service.

Thelen
24-07-2010, 22:07
Yea it is very sad, especially after the 2x30 minute downtime a week ago. With ZERO indication of what the problem is or when it will be fixed, you are loosing tonnes of business customers, who want to know these things...

Tz-OVH
24-07-2010, 21:58
And yet again, OVH/oles...host your forum/main site off-site...at least it'll be reachable at these times.

maksis
24-07-2010, 20:35
fra-5-6k returned. He is still struggling to put all the cards.
ams-1-6k came back, like he still has restarted a map
pob-1-6k it is indeed a crash, it fixed the serial cable, during boot
vss-2-6k Proxy ARP handed

this is the worst crash of the backbone that has never been in ovh ...
domino effect on routers that have not restarted for a
some time and that fragmentation of RAM.

Jul 24 40g.fra 8:21:29 p.m.-5-622981 6k.routers.chtix.eu: Pool: Processor Free: 30087848 Cause: Memory fragmentation
Jul 24 40g.fra 8:21:29 p.m.-5-6k.routers.chtix.eu 622 982: Alternate Pool: None Free: 0 Cause: No Alternate pool
Jul 24 40g.fra 8:21:29 p.m.-5-6k.routers.chtix.eu 622 983:-Process = "IP RIB Update", ipl = 0, pid = 164
Jul 24 40g.fra 8:21:29 p.m.-5-6k.routers.chtix.eu 622984:-Traceback = 4102AD28 410433E0 41030958 42305768 406417AC 409D2680 413C2D10 42,289,224 40,983,230 40,983,350
Jul 24 40g.fra 8:21:29 p.m.-5-6k.routers.chtix.eu 622985: Jul 24 7:21:07 p.m. GMT:% FIB-3-NORPXDRQELEMS: Exhausted XDR queuing elements while Preparing message for slot / cpu 1 / 0
Jul 24 40g.fra 8:21:29 p.m.-5-6k.routers.chtix.eu 622 986:-Process = "IP RIB Update", ipl = 0, pid = 164
Jul 24 40g.fra 8:21:29 p.m.-5-6k.routers.chtix.eu 622987:-Traceback 413C2DE0 = 42289224 406417AC 42305768 40983230 40983350 409D2680
Jul 24 40g.fra 8:21:46 p.m.-5-6k.routers.chtix.eu 623015: Jul 24 7:21:11 p.m. GMT:% FIB-3-NOMEM: malloc Failure, disabling DCEF
Jul 24 40g.fra 8:27:34 p.m.-5-6k.routers.chtix.eu 623147: Jul 24 7:27:15 p.m. GMT:% C6KFIB-4-DISABLED: Hardware FIB forwarding disabled, reverting to software-only forwarding.

It is time we put up the new generation of routers.
plan but is in September (they must be available)

ictdude
24-07-2010, 20:25
Quote Originally Posted by maksis
That looks really bad... whole thing is down a core router dead..

maybars
24-07-2010, 20:24
Quote Originally Posted by Rilly
is it the grey connections that indicates no traffic?
unfortunately yes. However, I think now fixed after 3 more interruptions.

Rilly
24-07-2010, 20:22
Quote Originally Posted by maksis
is it the grey connections that indicates no traffic?

Rilly
24-07-2010, 20:21
Looks to be all back up (for me at least)

Busby
24-07-2010, 20:09
It must be pretty bad as I can't even access ovh.co.uk/ovh.com from the UK - I am posting this via another sever in France - not on ovh network..

Hope its back soon its killing my radio station..

maksis
24-07-2010, 20:08
http://img8.imageshack.us/img8/2808/picture14iv.png nice

ictdude
24-07-2010, 20:04
Quote Originally Posted by RikT
Damn

Oles must of spilt his Champagne over the router
Yeah i noticed some customers already call me and ask whats wrong with there websites. ppfff ...

Myatu
24-07-2010, 20:03
Quote Originally Posted by RikT
Damn

Oles must of spilt his Champagne over the router
... Just when he was celebrating neutralizing that last rat

RikT
24-07-2010, 20:00
Damn

Oles must of spilt his Champagne over the router

Rilly
24-07-2010, 19:43
Multiple servers unreachable from multiple locations
(Posting for those that come looking if any known issues)

http://travaux.ovh.net/?do=details&id=4408

English translation

Task Type Incident
Category entire network
Current State
Percentage completed

Details
The router is down.

Comment OVH - Saturday, July 24, 2010, 20:24
fra-5-1 th1 evil. Not enough CPU.
It was deactivated on all the MPLS backbone.


Comment OVH - Saturday, July 24, 2010, 20:28
Jul 24 40g.fra 8:28:13 p.m.-5-6k.routers.chtix.eu 623150: Jul 24 7:27:53 p.m. GMT:% FIB-2-FIBDOWN: CEF has been disabled due to a low memory condition.
40g.fra Jul 24 8:28:13 p.m.-5-623151 6k.routers.chtix.eu: It can Be re-enabled by configuring "ip cef [Distributed]"