Network Outage in NYC3
Incident Report for DigitalOcean
Postmortem

The Incident

At approximately 20:33 UTC on August 13, DigitalOcean's NYC3 internet-facing edge routers experienced a simultaneous failure. During this downtime, internal services hosted in NYC3 were inaccessible, causing errors creating, updating, and authenticating resources in several other regions. Customer resources in NYC3 were unable to communicate with the public internet and other data centers.

The edge routers restarted the failed process and began rebuilding routes immediately, which took approximately 15 minutes to complete. Once the routers were back online and network connectivity was restored, engineers began recovering services impacted by the outage.

While internal services were being recovered, network engineers gathered the troubleshooting information from the routers and began engaging with vendor support to diagnose the root cause of the issue. A similar failure occurred on July 28, 2021, that also resulted in simultaneous edge router failures.

After working with the vendor, the root cause of the issue was determined to be a software bug that was encountered when trying to calculate certain routes. The vendor provided a software update that fixed the route calculation bug and was applied in an emergency maintenance.

Timeline of Events

August 13, 2021

20:33 UTC - Both internet-facing edge routers in the NYC3 data center crash with the same error. Multiple alerts begin paging engineers.

20:38 UTC - A critical incident is created and the Incident Manager begins paging appropriate on-call engineers.

20:39 UTC - Network engineers identify the loss of NYC3 edge routers and begin investigating the failure.

20:43 UTC - Routers finish rebooting and begin serving traffic again.

20:48 UTC - Connectivity between NYC3 edge routers and other data centers is restored and alerts begin resolving.

20:52 UTC - Engineers identify services that have not recovered from the network failure and begin remediation.

21:14 UTC - All services are recovered. Engineers declare the impact as resolved and move the incident to the monitoring phase.

23:20 UTC - Debug logs from the routers are sent to the vendor for root cause analysis.

August 14, 2021

00:43 UTC - Vendor has identified the root cause of the simultaneous crash and begins verifying solutions.

01:24 UTC - A software update is identified that fixes the bug and a maintenance is created to upgrade the Edge routers.

05:00 UTC - Edge router maintenance begins.

07:13 UTC - Edge router maintenance complete.

Future Measures

The affected routers were upgraded at 05:00 UTC on August 14, 2021, to prevent the issue from recurring. While DigitalOcean deploys critical networking infrastructure in redundant and highly available topologies, the bug in route calculation affected all routers in the group, preventing a failover from happening.

Follow-up work is being conducted to improve internal service redundancy for services that took an extended time to recover.

In Conclusion

We strive to keep DigitalOcean services available and reliable for our customers who run their businesses and projects on our platform. We apologize for the inconvenience this outage caused and will continue to use the lessons learned to drive improvements in both our systems and processes.

Posted Aug 20, 2021 - 16:18 UTC

Resolved
Our Engineering team has resolved the issue with networking in our NYC3 region. Networking service should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted Aug 14, 2021 - 07:43 UTC
Monitoring
Our Engineering team has implemented a fix for the issues that caused an impact on multiple products. We are continuing to monitor the situation closely. We apologize for the inconvenience and will share another update once the matter is fully resolved.
Posted Aug 13, 2021 - 21:19 UTC
Identified
Our engineering team has identified the cause of the networking issue with our NYC3 region and is actively working on a fix.
We will post an update as soon as possible
Posted Aug 13, 2021 - 20:58 UTC
Investigating
Our Engineering team is currently investigating an issue impacting multiple products. During this time, users may experience errors when interacting with Cloud Control Panel, API, and our Community platform. Users may see issues when accessing the Container Registry systems as well. There may be authentication issues for users with Managed Kubernetes clusters. Users may see connectivity issues with their Droplets in the NYC3 region as well. We apologize for any inconvenience and will share more information as soon as it's available.
Posted Aug 13, 2021 - 20:51 UTC
This incident affected: Regions (NYC3) and Services (Networking).