q10k8 BGP Flap Saturday 1st June 2024 22:28:20


While migrating one of our upstream providers, the mentioned device accidentally leaked too many routes to another upstream, causing a disruption of the BGP sessions.

We restored connectivity via ZET.net and a few minutes later via aurologic, again.

Post mortem about this incident and why it took so long to restore connectivity:

A little bit of backstory:

On May 31st, we started the migration to q10k8(the new edge router at Equinix). Everything went quite well and we were able to migrate almost all prefixes that we advertise to the DFZ without any interruption. We had a few DNS issues along the way because of the old QFX5100 stack, but these were resolved quite quickly.

At around 3:30 PM CET on June 1st, the migration of all prefixes from ear.fr7 -> to q10k8 was complete. We just had to make some minor config changes and some equipment maintenance onsite.

This also involved migrating one of our transit providers(ZET.net) to 100G. At 09:23 PM on June 2nd, we had everything ready to re-advertise all prefixes to ZET and we committed the configuration. However, we had an error in our export policy towards another upstream provider, which led to the redistribution of too many prefixes towards the mentioned upstream provider, causing the BGP neighbour on the other side to go into "Idle" state(MaxPath).

The error was caused by two rollbacks on the router, missing an "from community" statement.

At this time, we were not advertising any routes to ZET, so we had to

  1. gain access to q10k8 first
  2. Restore the connectivity via ZET

However, we thought that the active routing engine might have had some kind of OOM(out of memory) at that time, so we rebooted it.

This took most of the 33 minutes. As soon as we realized that something was wrong with the one particular session, we restored the connectivity via ZET and shortly after via aurologic again.

As soon as everything was back up, we fixed the configuration issue.

Will this incident happen again? Very unlikely. Such an event can only happen with multiple factors involved, and unfortunately, this was the case.