the above message into english
Hello,
We had a routing problem that night due
to a software bug affecting two core routers
Roubaix. These Cisco ASR 9010 provide for the collection
bandwidth data center in Roubaix (RBX1
RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short,
the core routing in Roubaix.
This bug is known and where it is linked to new cards
we have put into production in late January (24x10G by
slot). For some reason the random map will detect
ECC RAM errors and no longer route packets. But
especially nevertheless the card does not state "down"
and remains in the router as if it was good.
Other routers continue to send packets
but in front there is no one. Everything falls into a hole
black and the network no longer works properly.
Worst case: a failure not net.
That night, three cards of 2 24x10G ASR 9010 routers
had this bug there almost simultaneously. This has broken
the network into 3 pieces: United States / London / Amsterdam / Warsaw
Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing
the packets in Roubaix. Usually the traffic would
been rerouted but there it was aspirated and blocked in Roubaix.
So we were not able to exploit the network
administer the network and retrieve logs from all
routers to know the origin of the problem.
We sailed to the old, with connections
emergency / outside to connect to each
backbone router to check if the router
which is causing the problem. This operation has
took time, because in addition to two routers have been
down and it had been slow to understand that
it came not just a router rbx-g2-a9 but also
because of rbx-g1-a9. Once we restarted
the three cards all came back in 5 minutes.
There are about three weeks. We have already opened a
ticket to the near Cisco about this problem of RAM
ECC. Cisco has worked on the problem and we could
provide .. This morning the software patch to be applied on
routers to fix this problem here. We will
do this tonight. No failure to
predict.
It also looks at how to improve the management of our
routers in the case where the whole backbone is down
for some reason that never comes. It can handle
this case but it is slow. Very slow.
In all cases, the outage lasted more than 99.9%
ie 1:22 when we have "right" in 43 min
months of downtime. There is therefore the Penalties
triggers for exceeding the time allowed.
Example: on SD OVH is 5% per hour of downtime.
We will make a URL so you can
trigger the SLA and we send the doc to credit
the 5% of the time on your service. It will be posted in
the task
http://status.ovh.co.uk/?do=details&id=2571
It's never pleasant to write this kind of email but
when one is not good, Bahh, and we assume it
apologizes.
Sorry again.
Regards
Octave