At approximately 20:33 UTC on August 13, DigitalOcean's NYC3 internet-facing edge routers experienced a simultaneous failure. During this downtime, internal services hosted in NYC3 were inaccessible, causing errors creating, updating, and authenticating resources in several other regions. Customer resources in NYC3 were unable to communicate with the public internet and other data centers.
The edge routers restarted the failed process and began rebuilding routes immediately, which took approximately 15 minutes to complete. Once the routers were back online and network connectivity was restored, engineers began recovering services impacted by the outage.
While internal services were being recovered, network engineers gathered the troubleshooting information from the routers and began engaging with vendor support to diagnose the root cause of the issue. A similar failure occurred on July 28, 2021, that also resulted in simultaneous edge router failures.
After working with the vendor, the root cause of the issue was determined to be a software bug that was encountered when trying to calculate certain routes. The vendor provided a software update that fixed the route calculation bug and was applied in an emergency maintenance.
August 13, 2021
20:33 UTC - Both internet-facing edge routers in the NYC3 data center crash with the same error. Multiple alerts begin paging engineers.
20:38 UTC - A critical incident is created and the Incident Manager begins paging appropriate on-call engineers.
20:39 UTC - Network engineers identify the loss of NYC3 edge routers and begin investigating the failure.
20:43 UTC - Routers finish rebooting and begin serving traffic again.
20:48 UTC - Connectivity between NYC3 edge routers and other data centers is restored and alerts begin resolving.
20:52 UTC - Engineers identify services that have not recovered from the network failure and begin remediation.
21:14 UTC - All services are recovered. Engineers declare the impact as resolved and move the incident to the monitoring phase.
23:20 UTC - Debug logs from the routers are sent to the vendor for root cause analysis.
August 14, 2021
00:43 UTC - Vendor has identified the root cause of the simultaneous crash and begins verifying solutions.
01:24 UTC - A software update is identified that fixes the bug and a maintenance is created to upgrade the Edge routers.
05:00 UTC - Edge router maintenance begins.
07:13 UTC - Edge router maintenance complete.
The affected routers were upgraded at 05:00 UTC on August 14, 2021, to prevent the issue from recurring. While DigitalOcean deploys critical networking infrastructure in redundant and highly available topologies, the bug in route calculation affected all routers in the group, preventing a failover from happening.
Follow-up work is being conducted to improve internal service redundancy for services that took an extended time to recover.
We strive to keep DigitalOcean services available and reliable for our customers who run their businesses and projects on our platform. We apologize for the inconvenience this outage caused and will continue to use the lessons learned to drive improvements in both our systems and processes.