jonlewi5
29-03-2012, 10:10
Hello,
We had a routing problem that night due
to a software bug affecting two core routers
in Roubaix. These Cisco ASR 9010 provide for the collection
of the bandwidth of datacentres in Roubaix (RBX1
RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short,
the core routing in Roubaix. This bug is known and where it is linked to the new cards we've put into production in late January (24x10G per slot). For some reason the random map will detect errors ECC RAM and no longer route packets. But despite this particular card does not state "down" and rest in the router as if it was good. Other routers continue to send packets but in front there is no one. Everything falls into a hole black and the network no longer works properly. Worst case: a failure not net. That night, three cards of 2 24x10G ASR 9010 routers have this bug then almost simultaneously. This has broken the network into 3 pieces: United States / London / Amsterdam / Warsaw, Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing the packets in Roubaix. Usually the traffic would have been rerouted but there it was aspirated and blocked in Roubaix. So we have not been able to use the network to administer the network and retrieve logs from all routers in order to know the origin of the problem. We sailed to the old, with connections back / outside to connect to each backbone router to see if it's the router that is causing the problem. This operation took time because in addition two routers have put down and it had been slow to understand that it came not just a router rbx-g2-a9 but also because of rbx-g1-a9 . Once we restarted the 3 cards all came back in 5 minutes. There are about three weeks. We have already opened a ticket with Cisco about this problem on RAM ECC. Cisco has worked on the problem and was able to provide .. this morning to apply the patch software on the routers to fix this problem here. We will do this tonight. No failure to predict. It looks as how to improve the management of our routers if the entire backbone is down for some reason that never comes. We can handle this case but it is slow. Very slow. In all cases, the outage lasted more than 99.9% ie 1:22 when we have "right" to 43 min per month of downtime. So there are the Penalties for exceeding the trigger time allowed. Example: SD on OVH is 5% per hour of downtime. We will have a URL so you can trigger the SLA and send us to the doc crediting the 5% of the time on your service. It will be posted in the task http://travaux.ovh.com/?do=details&id=6533 It's never pleasant to write this kind of email but when one is not good, Bahh, we assume and we apologize. Sorry again. Regards Octave
We had a routing problem that night due
to a software bug affecting two core routers
in Roubaix. These Cisco ASR 9010 provide for the collection
of the bandwidth of datacentres in Roubaix (RBX1
RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short,
the core routing in Roubaix. This bug is known and where it is linked to the new cards we've put into production in late January (24x10G per slot). For some reason the random map will detect errors ECC RAM and no longer route packets. But despite this particular card does not state "down" and rest in the router as if it was good. Other routers continue to send packets but in front there is no one. Everything falls into a hole black and the network no longer works properly. Worst case: a failure not net. That night, three cards of 2 24x10G ASR 9010 routers have this bug then almost simultaneously. This has broken the network into 3 pieces: United States / London / Amsterdam / Warsaw, Roubaix and Paris, Frankfurt, Madrid, Milan, by drawing the packets in Roubaix. Usually the traffic would have been rerouted but there it was aspirated and blocked in Roubaix. So we have not been able to use the network to administer the network and retrieve logs from all routers in order to know the origin of the problem. We sailed to the old, with connections back / outside to connect to each backbone router to see if it's the router that is causing the problem. This operation took time because in addition two routers have put down and it had been slow to understand that it came not just a router rbx-g2-a9 but also because of rbx-g1-a9 . Once we restarted the 3 cards all came back in 5 minutes. There are about three weeks. We have already opened a ticket with Cisco about this problem on RAM ECC. Cisco has worked on the problem and was able to provide .. this morning to apply the patch software on the routers to fix this problem here. We will do this tonight. No failure to predict. It looks as how to improve the management of our routers if the entire backbone is down for some reason that never comes. We can handle this case but it is slow. Very slow. In all cases, the outage lasted more than 99.9% ie 1:22 when we have "right" to 43 min per month of downtime. So there are the Penalties for exceeding the trigger time allowed. Example: SD on OVH is 5% per hour of downtime. We will have a URL so you can trigger the SLA and send us to the doc crediting the 5% of the time on your service. It will be posted in the task http://travaux.ovh.com/?do=details&id=6533 It's never pleasant to write this kind of email but when one is not good, Bahh, we assume and we apologize. Sorry again. Regards Octave