On 2017-12-05 at 22:07 UTC, the Network Engineering team was performing ongoing routine maintenance to upgrade software on switching hardware in our TOR1 region, in an effort to move off of software with known bugs. Part of our process for upgrading the software is to first isolate the switch being upgraded, to mitigate any issues from occurring during the upgrade. This usually works very well; however, during this particular incident, the switch being isolated triggered a bug which caused neighboring switches to drop their connections to downstream devices. This resulted in a subset of Droplets having no external connectivity for approximately 12 minutes. The network recovered following a reboot of the switch.
22:07 UTC - Switches are isolated in preparation for upgrade
22:18 UTC - Network issues are detected in TOR1
22:19 UTC - Reboot initiated and connectivity is restored to impacted switches
While this is routine work performed by our network engineering team, we normally perform this activity during scheduled maintenance windows that have been communicated to customers. We did not follow that best practice in this case and we are conducting an internal review of our processes to ensure that all necessary steps are taken and communicated before we undertake this type of work in the future. With regards to the issue of the outage, the root cause has been resolved by the software upgrade we now have installed.
We apologize for any inconveniences caused by this outage. We take the stability of our services seriously and will ensure we work to improve in all areas where we can.