OVH Community, your new community space.

Routing problem during the night.


jonlewi5
29-03-2012, 10:10
Hello,
We had a routing problem that night due
to a software bug affecting two core routers
in Roubaix. These Cisco ASR 9010 provide for the collection
of the bandwidth of datacentres in Roubaix (RBX1
RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short,
the core routing in Roubaix. This bug is known and where it is linked to the new cards we've put into production in late January (24x10G per slot). For some reason the random map will detect errors ECC RAM and no longer route packets. But despite this particular card does not state "down" and rest in the router as if it was good. Other routers continue to send packets but in front there is no one. Everything falls into a hole black and the network no longer works properly. Worst case: a failure not net. That night, three cards of 2 24x10G ASR 9010 routers have this bug then almost simultaneously. This has broken the network into 3 pieces: United States / London / Amsterdam / Warsaw, Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing the packets in Roubaix. Usually the traffic would have been rerouted but there it was aspirated and blocked in Roubaix. So we have not been able to use the network to administer the network and retrieve logs from all routers in order to know the origin of the problem. We sailed to the old, with connections back / outside to connect to each backbone router to see if it's the router that is causing the problem. This operation took time because in addition two routers have put down and it had been slow to understand that it came not just a router rbx-g2-a9 but also because of rbx-g1-a9 . Once we restarted the 3 cards all came back in 5 minutes. There are about three weeks. We have already opened a ticket with Cisco about this problem on RAM ECC. Cisco has worked on the problem and was able to provide .. this morning to apply the patch software on the routers to fix this problem here. We will do this tonight. No failure to predict. It looks as how to improve the management of our routers if the entire backbone is down for some reason that never comes. We can handle this case but it is slow. Very slow. In all cases, the outage lasted more than 99.9% ie 1:22 when we have "right" to 43 min per month of downtime. So there are the Penalties for exceeding the trigger time allowed. Example: SD on OVH is 5% per hour of downtime. We will have a URL so you can trigger the SLA and send us to the doc crediting the 5% of the time on your service. It will be posted in the task http://travaux.ovh.com/?do=details&id=6533 It's never pleasant to write this kind of email but when one is not good, Bahh, we assume and we apologize. Sorry again. Regards Octave

oles@ovh.net
28-03-2012, 19:50
Hello,
We had a routing problem during the night (03-28-2012) due to a software bug affecting two core routers in Roubaix. The Cisco ASR 9010 routers provide for the collection bandwidth data at the center in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short, the core routing in Roubaix.

This bug is known and is linked to new cards we installed in late January (24x10G by slot). For some reason the router will detect ECC RAM errors and no longer route packets. But nevertheless the card does not state "down" and remains in the router as if it is still operational. Other routers continue to send packets but there is no next hop. The packets are then dropped causing a problem on the network. Worst case scenario: Network goes down.

That night, three cards - 24x10G ASR 9010 routers had this bug almost simultaneously. This broke the network into 3 pieces: United States / London / Amsterdam / Warsaw
Roubaix and Paris, Frankfurt, Madrid, Milan, by dropping the packets in Roubaix. Usually the traffic would have been rerouted but it was blocked in Roubaix.

As a result, we were not able to utilise the network administration facility to retrieve logs from all the routers in order to establish the cause of the problem. We switched to the emergency external routers to check which of the backbone routers was causing the problem. This operation took some time, due to two routers being down and the delay in finding the problem because it was not only an issue with one router rbx-g2-a9 but also rbx-g1-a9. Once restarted, the 3 cards were operational within 5 minutes.

Over the last 3 weeks, we have opened a ticket with Cisco regarding the problem with the RAM ECC. Cisco have been working on the problem and have provided a software patch to fix the it. We will do this tonight. It also looks at how we can improve the management of our routers in the case where the whole backbone is down for some reason. It will be able to handle this but it will be very, very slow.

So, the outage lasted more than 99.9% ie 1:22 when we are allowed "a maximum" of 43 mins of downtime per month. Therefore, there are penalties for exceeding the maximum time. For example: on SD OVH is 5% per hour of downtime. We have created a URL so you can apply for credit under your SLA
See the following link http://travaux.ovh.com/?do=details&id=6533

It's never pleasant to write this kind of email but when our service is not up to standard, we must apologise.

Sorry again.

Regards
Octave