At 10:22 UTC three racks in our AMS2 region lost network connectivity. This resulted in all Droplets in the three racks losing all connectivity to and from the Internet. This occurred while hardware was being swapped out for new equipment.
The infrastructure team was swapping out optical components on the top of rack (ToR) switches which connect the three racks into the core network. These switches have dual links into the core and the hardware swap involved bringing one link down at a time while the hardware was swapped. We had successfully completed this on numerous racks in AMS2 earlier in the day with zero impact to customers.
The software running on these ToR switches was on a non-standard version which contained a bug that caused the link aggregation protocol (LACP) to fail. This resulted in connectivity for everything on the switch being lost until we were able to reconfigure the three switches.
Timeline of Events
10:20 UTC: One optical module is pulled out of three separate switches
10:22 UTC: Alerts received for three downed switches
10:25 UTC: Networking team responds and begins investigation
10:48 UTC: Configuration applied to reset LACP on the affected switches
10:50 UTC: Alerts clear and connectivity is restored
We are already aware of other devices in our network which are affected by this LACP bug. In the coming weeks we will upgrade the software to prevent this issue from recurring. We will send out a maintenance notice in advance of the update.
We want to apologize for the experience you had today. We take any form of negative impact on your service seriously. Our goal is to constantly improve and we hope that our transparency and the details that we have shared help show you that.