OVH Community, your new community space.

Routing problem tonight


mike_
18-04-2012, 13:35
[nevermind misread it ]

Neil
05-04-2012, 17:34
Hi

Just been told it is next week, so keep an eye out

RichardWnl
05-04-2012, 17:03
When is Octave going to post the link?

We are going to set an URL so that you could be able to trigger the SLA and send us the doc to credit the 5% time on your service. It will be posted in this task:
http://status.ovh.net/?do=details&id=2571

wii89
29-03-2012, 00:10
the above message into english

Hello,
We had a routing problem that night due
to a software bug affecting two core routers
Roubaix. These Cisco ASR 9010 provide for the collection
bandwidth data center in Roubaix (RBX1
RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short,
the core routing in Roubaix.

This bug is known and where it is linked to new cards
we have put into production in late January (24x10G by
slot). For some reason the random map will detect
ECC RAM errors and no longer route packets. But
especially nevertheless the card does not state "down"
and remains in the router as if it was good.
Other routers continue to send packets
but in front there is no one. Everything falls into a hole
black and the network no longer works properly.
Worst case: a failure not net.

That night, three cards of 2 24x10G ASR 9010 routers
had this bug there almost simultaneously. This has broken
the network into 3 pieces: United States / London / Amsterdam / Warsaw
Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing
the packets in Roubaix. Usually the traffic would
been rerouted but there it was aspirated and blocked in Roubaix.

So we were not able to exploit the network
administer the network and retrieve logs from all
routers to know the origin of the problem.
We sailed to the old, with connections
emergency / outside to connect to each
backbone router to check if the router
which is causing the problem. This operation has
took time, because in addition to two routers have been
down and it had been slow to understand that
it came not just a router rbx-g2-a9 but also
because of rbx-g1-a9. Once we restarted
the three cards all came back in 5 minutes.

There are about three weeks. We have already opened a
ticket to the near Cisco about this problem of RAM
ECC. Cisco has worked on the problem and we could
provide .. This morning the software patch to be applied on
routers to fix this problem here. We will
do this tonight. No failure to
predict.

It also looks at how to improve the management of our
routers in the case where the whole backbone is down
for some reason that never comes. It can handle
this case but it is slow. Very slow.

In all cases, the outage lasted more than 99.9%
ie 1:22 when we have "right" in 43 min
months of downtime. There is therefore the Penalties
triggers for exceeding the time allowed.
Example: on SD OVH is 5% per hour of downtime.
We will make a URL so you can
trigger the SLA and we send the doc to credit
the 5% of the time on your service. It will be posted in
the task http://status.ovh.co.uk/?do=details&id=2571

It's never pleasant to write this kind of email but
when one is not good, Bahh, and we assume it
apologizes.

Sorry again.

Regards
Octave

mike_
28-03-2012, 20:40
Does this apply to Kimsufi too? kimsufi.co.uk says SLA of 99.9%. My server was down for about 90 minutes during the early hours of this morning.

oles@ovh.net
28-03-2012, 19:50
Hello,

We had a routing problem last night due to a software bug affecting two core routers in Roubaix. These Cisco ASR 9010 provide bandwidth for data centers in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris, Brussels, Amsterdam, London and Frankfurt. In short, the core routing in Roubaix.

This bug is known and was linked to new cards we have put into production in late January (24x10G slots). For some reason the random map will detect
ECC RAM errors and no longer route packets. But especially nevertheless the card do not have state "down" and remains in the router as if it was good.
Other routers continue to send packets but in front there is no one. Everything falls into a black hole and the network no longer works properly.
Worst case: a failure not net.

That night, three cards of 2 24x10G ASR 9010 routers had this bug and almost simultaneously. This has broken the network into 3 pieces: United States / London / Amsterdam / Warsaw Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing the packets in Roubaix. Usually the traffic would
been rerouted but there it was aspirated and blocked in Roubaix.

So we were not able to administer the network and retrieve logs from all routers to know the origin of the problem. We sailed to the old, with connections
emergency / outside to connect to each backbone router to check if the router which is causing the problem. This operation took time, because in addition to two routers have been down and it had been slow to understand that it came not just router rbx-g2-a9 but also because of rbx-g1-a9. Once we restarted
the three cards all came back in 5 minutes.

There are about three weeks. We have already opened a ticket to Cisco about this problem of RAM ECC. Cisco has worked on the problem and we could
provide .. This morning the software patch to be applied on routers to fix this problem here. We will do this tonight. No failure to predict.

It also looks at how to improve the management of our routers in the case where the whole backbone is down for some reason that hopefully never comes. It can handle this case but it is slow. Very slow.

In all cases, the outage lasted more than 99.9% ie 1:22 when we have "right" in 43 min months of downtime. There is therefore the Penalties
triggers for exceeding the time allowed.Example: on OVH Dedicated Server is 5% per hour of downtime.

We will make a URL so you can trigger the SLA and we send the doc to credit the 5% of the time on your service. It will be posted in
the task http://status.ovh.co.uk/?do=details&id=2571

It's never pleasant to write this kind of email but we apologise..

Sorry again.

Regards
Octave
_________