Networking and Downstream services in NYC1, NYC2, FRA1, TOR1
Incident Report for DigitalOcean
Postmortem

The Incident

Beginning at 23:34 UTC, June 13th, the DigitalOcean NYC, FRA1, and TOR1 data center regions experienced a series of three separate Internet connection outages. This prevented customer Droplets, including Kubernetes clusters, App Platform, Load Balancers, and Spaces in those data center regions from accessing the Internet during the outages. Additionally, users were unable to access the Cloud Control Panel and the DigitalOcean API during the first two of the outages. Customers may have also seen some gaps in metrics in Insights during the incident. This incident was triggered by a software bug in the border routers of those data center regions. Other regions were running a newer version of the software and were not impacted. The updated software has been deployed to impacted systems and remaining clean up work has been completed. 

Our engineers were able to detect and begin an incident response for this issue within minutes of the first signs of impact. We quickly determined the root cause. However, confounding factors extended the time to full resolution as detailed below.

The root cause was discovered to be a bug in the operating system software of critical edge routing hardware serving four data center regions, causing these routing systems to crash. The bug was triggered by interpretation of automatically propagated routes. Routers propagate routing information between redundant nodes as efficiently as possible to avoid errors. This behavior of standard routing protocols, coupled with the bug, caused clustered routers to crash nearly concurrently. 

Without the routing service, the edge routers quickly degraded and stopped serving traffic. This cut off the DigitalOcean network behind the edge routers in question from the rest of the Internet during these crashes. Besides impact to customers, the lack of external network access complicated efforts to troubleshoot and fix problems and spread secondary impacts to nearly all DigitalOcean services. The core problem occurred in three waves of approximately 27 minutes each.

To fix this issue, DigitalOcean engineers deployed emergency software upgrades to the impacted systems. These upgrades were the most time-consuming part of the response, but,  when complete, they resolved the bug in the impacted systems for the long term. Internet access for Cloud, API, Droplets, Managed Kubernetes, App Platform, Insights, Load Balancers, and Spaces products was restored by the end of the emergency software upgrades. Some asynchronous customer-triggered events like power offs and resizes were stuck until DigitalOcean engineers completed manual cleanup of affected events after the incident.

Timeline of Events

23:34 UTC, June 13 - First set of router crashes began

During the first set of crashes, lasting about 24 minutes, the main network edge routers in NYC1, NYC2, FRA1 and TOR1 data center regions would all eventually fall offline.

23:40 UTC - Response began

Six minutes into the initial effects, DigitalOcean observability engineers discovered signs of a serious network problem and escalated that information at about the same time network alerts lit up across the board.

In response to the escalation, an incident team was organized, slowed somewhat by internal tooling being affected by the outage.

While responders worked to find the root cause of the event and mitigate impacts, failures spread to some other DigitalOcean products due to heavy loads.

23:58 UTC - The first set of router crashes resolved on its own as the border routers self-healed

Alerts indicate that the initial wave of edge router crashes concluded just before midnight UTC. 

00:30 UTC, June 14 - Incident responders believed customer impact was resolved

Just over 30thirty minutes later, the team was confident that the root cause was identified. Responders believed at this time that the problem was, at least for the short-term, resolved by the systems in question self-healing, and they decided to monitor the problem while considering long-term solutions.

00:54 UTC - Second set of router crashes began

While additional long-term mitigations were being prepared by the incident responders, the systems unexpectedly crashed again, following the same pattern as the first wave.

01:06 UTC - Emergency upgrades planned

Incident responders decided that the safest and surest solution to the problem was to perform upgrades to the edge routers on an emergency basis right away. This upgrade was previously planned but was waiting on capacity upgrades as discussed below.

01:22 UTC - Second set of router crashes and customer impact resolved via systems self-healing

01:49 UTC - Emergency upgrades began in NYC data center regions

02:50 UTC - Third set of router crashes began

Toward the end of deploying the upgrades to NYC data center regions, the problem once again hit the TOR1 and FRA1 data center regions. This time, the upgraded routers in NYC were safe from the issue, but some DigitalOcean services that serve the NYC regions were still impacted.

03:02 UTC - Emergency upgrades completed in NYC data center regions

After 3:02 UTC, there was no additional impact from the bug to customer Droplet connectivity directly. 

The regions outside NYC still saw failures, and other internal systems impacted by the crashes required further manual attention, but the blast radius was reduced by the emergency upgrades. 

03:07 UTC - Emergency upgrades began in remaining data center regions

Upgrades to TOR1 and FRA1 began.

03:19 UTC -  Third set of router crashes resolved

08:03 UTC - Emergency upgrades deployed and completed on all impacted edge routers

All emergency upgrades in all data center regions are completed by this time.

09:01 UTC - Responders declare the incident resolved

After a period of monitoring to ensure that all internal impact had been resolved and no further customer impact was reported, the incident was declared resolved.

Discussion

Last year, a similar outage occurred in the NYC3 data center region and that issue was fixed for the long term via software upgrade in that region. Most of our other data center regions also received the upgrade proactively at that time. None of the upgraded systems crashed. This outage has been confirmed by DigitalOcean engineers to be the same issue, spread across the four data center regions where this upgrade was delayed.

In the affected data center regions, this upgrade was delayed to ensure that additional hardware capacity was first deployed so that customer traffic wouldn’t be impacted during a scheduled upgrade activity due to the popularity of these regions. All existing hardware has now been upgraded, and DigitalOcean is in the process of improving our network architecture so that staggering upgrades will not be necessary in the future. 

Future Measures

All affected routers were upgraded by 8:03 UTC on June 14, 2022, to prevent the issue from recurring. This incident demonstrated that the software upgrade remediated this issue, since all data center regions on upgraded software were immune to the outage. Additional hardware upgrades and improvements are planned for these and several other core network systems, along with improved designs and procedures. DigitalOcean is confident that these planned improvements will make our systems even more resilient and performant.

In Conclusion

DigitalOcean values the trust our customers place in us above all else and that’s why we are committed to providing reliable, best-in-class services to those who use our platform. We understand service interruptions like this can have a real impact on our customers’ businesses and projects, and we apologize for the inconvenience of this outage. DigitalOcean is confident in our commitment to excellence and our ongoing process of learning and improvement will drive better and better outcomes for our users.

Posted Jul 11, 2022 - 23:03 UTC

Resolved
Our Engineering team has resolved the issue with failure of multiple services in NYC1, NYC2, FRA1, TOR1. All the services and Networking should now be operating normally.

If you continue to experience problems, please open a ticket with our support team from within your Cloud Control Panel. Thank you for your patience and we apologize for any inconvenience.
Posted Jun 14, 2022 - 09:05 UTC
Monitoring
Our Engineering team has implemented a fix for the issue impacting multiple services in NYC1, NYC2, FRA1, TOR1. Our team completed the Networking maintenance in these regions and is currently monitoring the situation.

Thank you for your patience, and we will post an update as soon as the issue is fully resolved.
Posted Jun 14, 2022 - 08:30 UTC
Update
Our team is continuing work to complete maintenance in NYC2 and FRA1 . Maintenance in NYC1 and TOR1 is complete. We are not currently seeing any impact from the identified root cause, but there could be certain impact as mentioned in previous updates.

We are working as quickly as possible and will provide updates as necessary. Thank you.
Posted Jun 14, 2022 - 07:04 UTC
Update
Our team is continuing work to complete maintenance in TOR1 and NYC2. We are starting to work on FRA1. NYC1 is complete. We are not currently seeing any impact from the identified root cause, but there could be certain impact as mentioned in the previous update.

We will share an update as soon as we have more information.
Posted Jun 14, 2022 - 05:21 UTC
Update
Our team is continuing work to complete maintenance in TOR1. NYC1 is complete. We are not currently seeing any impact from the identified root cause, but until all regions have undergone emergency maintenance, recurrences of the networking and downstream service issue are possible in regions which have not had maintenance completed.

We are working as quickly as possible and will provide updates as necessary. Thank you.
Posted Jun 14, 2022 - 04:02 UTC
Update
Our team has completed maintenance in NYC1 at this time. We are seeing recurrence of the same issue in NYC2, TOR1, and FRA1, due to the same root cause and users may experience errors as noted in previous updates. We have confirmed that the completed maintenance in NYC1 mitigated the issue recurring there, so NYC1 is not impacted. Our team is now working on maintenance for TOR1.
Posted Jun 14, 2022 - 03:10 UTC
Update
Our Engineering team is now beginning emergency maintenance in NYC1, NYC2, FRA1, and TOR1 to implement a long-term fix for the multiple issues referenced in this incident. As this work progresses, we do not expect user impact to networking connectivity with regards to Droplet and Droplet-based products (such as Managed Kubernetes and Databases).

We do not have a firm ETA on completing maintenance but will post updates as we have more information.
Posted Jun 14, 2022 - 02:12 UTC
Update
As of 00:56 UTC, we saw a recurrence of this incident, impacting users similarly to our previous updates. We are now seeing recovery again for services and the root cause has been identified.

Our Engineering team is working to implement a long-term fix to ensure continued stability. Thank you for your patience and we will post another update as soon as we have further details.
Posted Jun 14, 2022 - 01:33 UTC
Update
Our Engineering team is currently investigating a recurrence of this issue and users may once again see issues with multiple products and services as noted in our last updates.

We apologize and will provide another update as soon as we have more information.
Posted Jun 14, 2022 - 01:07 UTC
Identified
Our Engineering team has identified the root cause of the multiple failures and a fix has been implemented. To add to our earlier update, we've confirmed this incident impacted Droplet networking as well and users may have experienced packet loss/latency and loss of connectivity with Droplets in NYC1, NYC2, FRA1, and TOR1, including Managed Kubernetes.

At this time, users should see error rates subsiding and normal connectivity returning.

Thank you for your patience.
Posted Jun 14, 2022 - 00:33 UTC
Investigating
As of approximately 23:40 UTC, our Engineering team is investigating multiple failures with DigitalOcean services, including our Cloud Control Panel, API, events such as creation of all resources, and WWW endpoints failing to resolve. We are starting to see recovery but are still investigating root cause and users may continue to experience elevated errors.

We will post another update as soon as further information is available.
Posted Jun 14, 2022 - 00:00 UTC
This incident affected: Regions (Global, FRA1, NYC1, NYC2, TOR1) and Services (API, App Platform, Cloud Control Panel, DNS, Droplets, Event Processing, Kubernetes, Managed Databases, Networking, WWW, Functions).