Beginning at 23:34 UTC, June 13th, the DigitalOcean NYC, FRA1, and TOR1 data center regions experienced a series of three separate Internet connection outages. This prevented customer Droplets, including Kubernetes clusters, App Platform, Load Balancers, and Spaces in those data center regions from accessing the Internet during the outages. Additionally, users were unable to access the Cloud Control Panel and the DigitalOcean API during the first two of the outages. Customers may have also seen some gaps in metrics in Insights during the incident. This incident was triggered by a software bug in the border routers of those data center regions. Other regions were running a newer version of the software and were not impacted. The updated software has been deployed to impacted systems and remaining clean up work has been completed.
Our engineers were able to detect and begin an incident response for this issue within minutes of the first signs of impact. We quickly determined the root cause. However, confounding factors extended the time to full resolution as detailed below.
The root cause was discovered to be a bug in the operating system software of critical edge routing hardware serving four data center regions, causing these routing systems to crash. The bug was triggered by interpretation of automatically propagated routes. Routers propagate routing information between redundant nodes as efficiently as possible to avoid errors. This behavior of standard routing protocols, coupled with the bug, caused clustered routers to crash nearly concurrently.
Without the routing service, the edge routers quickly degraded and stopped serving traffic. This cut off the DigitalOcean network behind the edge routers in question from the rest of the Internet during these crashes. Besides impact to customers, the lack of external network access complicated efforts to troubleshoot and fix problems and spread secondary impacts to nearly all DigitalOcean services. The core problem occurred in three waves of approximately 27 minutes each.
To fix this issue, DigitalOcean engineers deployed emergency software upgrades to the impacted systems. These upgrades were the most time-consuming part of the response, but, when complete, they resolved the bug in the impacted systems for the long term. Internet access for Cloud, API, Droplets, Managed Kubernetes, App Platform, Insights, Load Balancers, and Spaces products was restored by the end of the emergency software upgrades. Some asynchronous customer-triggered events like power offs and resizes were stuck until DigitalOcean engineers completed manual cleanup of affected events after the incident.
23:34 UTC, June 13 - First set of router crashes began
During the first set of crashes, lasting about 24 minutes, the main network edge routers in NYC1, NYC2, FRA1 and TOR1 data center regions would all eventually fall offline.
23:40 UTC - Response began
Six minutes into the initial effects, DigitalOcean observability engineers discovered signs of a serious network problem and escalated that information at about the same time network alerts lit up across the board.
In response to the escalation, an incident team was organized, slowed somewhat by internal tooling being affected by the outage.
While responders worked to find the root cause of the event and mitigate impacts, failures spread to some other DigitalOcean products due to heavy loads.
23:58 UTC - The first set of router crashes resolved on its own as the border routers self-healed
Alerts indicate that the initial wave of edge router crashes concluded just before midnight UTC.
00:30 UTC, June 14 - Incident responders believed customer impact was resolved
Just over 30thirty minutes later, the team was confident that the root cause was identified. Responders believed at this time that the problem was, at least for the short-term, resolved by the systems in question self-healing, and they decided to monitor the problem while considering long-term solutions.
00:54 UTC - Second set of router crashes began
While additional long-term mitigations were being prepared by the incident responders, the systems unexpectedly crashed again, following the same pattern as the first wave.
01:06 UTC - Emergency upgrades planned
Incident responders decided that the safest and surest solution to the problem was to perform upgrades to the edge routers on an emergency basis right away. This upgrade was previously planned but was waiting on capacity upgrades as discussed below.
01:22 UTC - Second set of router crashes and customer impact resolved via systems self-healing
01:49 UTC - Emergency upgrades began in NYC data center regions
02:50 UTC - Third set of router crashes began
Toward the end of deploying the upgrades to NYC data center regions, the problem once again hit the TOR1 and FRA1 data center regions. This time, the upgraded routers in NYC were safe from the issue, but some DigitalOcean services that serve the NYC regions were still impacted.
03:02 UTC - Emergency upgrades completed in NYC data center regions
After 3:02 UTC, there was no additional impact from the bug to customer Droplet connectivity directly.
The regions outside NYC still saw failures, and other internal systems impacted by the crashes required further manual attention, but the blast radius was reduced by the emergency upgrades.
03:07 UTC - Emergency upgrades began in remaining data center regions
Upgrades to TOR1 and FRA1 began.
03:19 UTC - Third set of router crashes resolved
08:03 UTC - Emergency upgrades deployed and completed on all impacted edge routers
All emergency upgrades in all data center regions are completed by this time.
09:01 UTC - Responders declare the incident resolved
After a period of monitoring to ensure that all internal impact had been resolved and no further customer impact was reported, the incident was declared resolved.
Last year, a similar outage occurred in the NYC3 data center region and that issue was fixed for the long term via software upgrade in that region. Most of our other data center regions also received the upgrade proactively at that time. None of the upgraded systems crashed. This outage has been confirmed by DigitalOcean engineers to be the same issue, spread across the four data center regions where this upgrade was delayed.
In the affected data center regions, this upgrade was delayed to ensure that additional hardware capacity was first deployed so that customer traffic wouldn’t be impacted during a scheduled upgrade activity due to the popularity of these regions. All existing hardware has now been upgraded, and DigitalOcean is in the process of improving our network architecture so that staggering upgrades will not be necessary in the future.
All affected routers were upgraded by 8:03 UTC on June 14, 2022, to prevent the issue from recurring. This incident demonstrated that the software upgrade remediated this issue, since all data center regions on upgraded software were immune to the outage. Additional hardware upgrades and improvements are planned for these and several other core network systems, along with improved designs and procedures. DigitalOcean is confident that these planned improvements will make our systems even more resilient and performant.
DigitalOcean values the trust our customers place in us above all else and that’s why we are committed to providing reliable, best-in-class services to those who use our platform. We understand service interruptions like this can have a real impact on our customers’ businesses and projects, and we apologize for the inconvenience of this outage. DigitalOcean is confident in our commitment to excellence and our ongoing process of learning and improvement will drive better and better outcomes for our users.