Network Connectivity in TOR1
Incident Report for DigitalOcean

The Incident

On 2017-12-05 at 22:07 UTC, the Network Engineering team was performing ongoing routine maintenance to upgrade software on switching hardware in our TOR1 region, in an effort to move off of software with known bugs. Part of our process for upgrading the software is to first isolate the switch being upgraded, to mitigate any issues from occurring during the upgrade. This usually works very well; however, during this particular incident, the switch being isolated triggered a bug which caused neighboring switches to drop their connections to downstream devices. This resulted in a subset of Droplets having no external connectivity for approximately 12 minutes. The network recovered following a reboot of the switch.

Timeline of Events

22:07 UTC - Switches are isolated in preparation for upgrade

22:18 UTC - Network issues are detected in TOR1

22:19 UTC - Reboot initiated and connectivity is restored to impacted switches

Future Measures

While this is routine work performed by our network engineering team, we normally perform this activity during scheduled maintenance windows that have been communicated to customers. We did not follow that best practice in this case and we are conducting an internal review of our processes to ensure that all necessary steps are taken and communicated before we undertake this type of work in the future. With regards to the issue of the outage, the root cause has been resolved by the software upgrade we now have installed.

In Conclusion

We apologize for any inconveniences caused by this outage. We take the stability of our services seriously and will ensure we work to improve in all areas where we can.

Posted 6 days ago. Dec 08, 2017 - 22:51 UTC

Resolved
At this time network connectivity in TOR1 has been resolved by our networking team. We really appreciate your patience as we worked through this issue and apologize for the impact this may have had. If you continue to experience any issues please open a support ticket.
Posted 9 days ago. Dec 05, 2017 - 23:20 UTC
Monitoring
We experienced a brief issue with network connectivity in our TOR1 region. This is being monitored by our networking team. We appreciate your patience as we work through this and apologize for any issues this may have caused for you. If you continue to experience any issues please open a support ticket.
Posted 9 days ago. Dec 05, 2017 - 22:40 UTC
This incident affected: Services (Networking).