Postmortem
Summary
Beginning 09:55 UTC, August 9th, we experienced significant network outages that impacted Internet access for our AMS3, LON1, and SFO2 datacenters. These outages occurred in three waves, each lasting 10-15 minutes with approximately an hour between them. The last wave occurred at around 12:15 UTC. While addressing the root cause of this incident, additional network disruptions happened, spanning a total of 1-2 hours. At 14:23 UTC, we completely mitigated the issue, and the network was declared stable in all datacenters.
UTC Timeline (By Datacenter)
LON1:
09:55 - 10:11: Total Internet networking outage
11:05 - 11:15: Second wave of total Internet networking outage
12:13 - 12:21: Third wave of Internet networking outage, 90% of customer traffic impacted
12:43 - 12:46: Slight loss of traffic as traffic reconverges
12:52 - 12:57: Partial outage, up to 40% of customer traffic impacted
13:29: Upgrades completed in LON1. Site stable
AMS3:
09:50 - 09:53: Partial Internet networking outage, up to 40% of customer traffic impacted
10:58 - 11:01: Partial Internet networking outage, up to 50% of customer traffic impacted
11:30 - 12:05: Partial Internet networking outage, 50% of customer traffic impacted
12:20 - 12:27: Partial Internet networking outage, up to 40% of customer traffic impacted
12:27: Upgrades completed in AMS3. Site stable
SFO2:
09:55 - 10:11: Total Internet networking outage
11:02 - 11:14: Partial Internet networking outage, up to 50% of customer traffic impacted
11:48 - 12:27: Second wave of total Internet networking outage
12:41 - 12:51: Third wave of total Internet networking outage
14:21 - 14:23: Partial networking outage, up to 50% of customer traffic impacted
14:23: Upgrades completed in SFO2. Site stable
Root Cause
We discovered a bug in the operating system software version running on some of our edge routers.
In the event of receiving an RPKI-related routing update (more background information on RPKI here: https://help.apnic.net/s/article/Resource-Public-Key-Infrastructure-RPKI), the routers could experience a crash in their routing process. Each of the sites was redundantly connected to the Internet via multiple devices. All of the devices were running versions of software that had the bug, so when the bug was triggered it caused an impact across the entire internet edge. The routing processes automatically restarted, but reprocessing and updating the global routing tables was spread out over the next few minutes. As a result, the affected datacenters experienced multiple network outages, each lasting for around 10-15 minutes, as the routing processes recovered.
Following an investigation from our vendor, we decided to roll out a software update to the edge routers running the affected version. The emergency upgrade of the network devices was focused on recovering stability and added some additional reachability issues while the traffic was shifted off circuits and the edge routers rejoined the Internet.
Once the rollout was complete, the bug was resolved for the long term, and network stability returned across all DigitalOcean datacenters.
Conclusion
Ensuring redundancy is a critical priority and we continue to invest in our broad infrastructure to provide increasing levels of reliability and performance, including greater redundancy. We realize the trust our customers place in running workloads on DigitalOcean and we are focused on earning that trust.
All global edge routers have been upgraded to mitigate the bug and we do not expect any further impact.
We are in the midst of implementing network upgrades that will provide additional improvements to the reliability and security of our network. Locations in Asia have received these upgrades, and Europe and North America will receive them later this year and early in 2024.
Posted Aug 11, 2023 - 23:05 UTC
Resolved
Our Engineering team has confirmed full resolution of the issue impacting network connectivity in multiple regions. The impact has been completely subsided and the network connectivity is back to normal for all the impacted services. Users should now be able to process events normally for Droplets and Droplet-based services like Load Balancers, Kubernetes or Database clusters, etc.
If you continue to experience problems, please open a ticket with our support team from your Cloud Control Panel.
Thank you for your patience and we apologize for any inconvenience.
Posted Aug 09, 2023 - 16:01 UTC
Monitoring
The rollout of the fix to redundant networking equipment is fully completed, meaning all networking devices in AMS3, LON1, and SFO2 have been patched. Our Engineering team saw a brief period where SFO2 was impacted as traffic reconverged.
At this time, all services and regions impacted by this incident should be recovered.
We're now monitoring to ensure stability.
If you are still experiencing any issues, please let our Support team know.
Posted Aug 09, 2023 - 14:58 UTC
Update
Our Engineering team saw congestion on network routes between AMS3 and FRA1, which impacted users with services in those regions, as well as traffic on paths between those regions (including services in LON1). Users would have seen latency and connectivity errors.
Although this issue is separate from the root cause of this incident, it was exacerbated by this issue.
Our team has taken action to rebalance traffic and are seeing better performance at this time.
Our team is finishing up the rollout to redundant networking equipment and we're continuing to monitor for any customer impact.
Posted Aug 09, 2023 - 13:52 UTC
Update
The fix has now been rolled out for our LON1 region and users should be able to interact with their resources in LON1 normally at this time.
This concludes the rollout and all services in all regions should be recovered or showing recovery.
Our Engineering team is continuing work to roll out this fix to our redundant networking equipment, so that in the case of a failover, we are not susceptible to the issue that occurred and is mitigated by this rollout.
Posted Aug 09, 2023 - 13:27 UTC
Update
The fix has now been rolled out for our SFO2 region and users should be able to interact with their resources in SFO2 normally at this time.
Our team continues to work on the rollout for LON1.
Posted Aug 09, 2023 - 13:08 UTC
Update
Our Engineering team continues to work on a fix for the issue impacting connectivity in multiple regions. At this time, a fix has been rolled out in AMS3 and a fix in LON1 is underway.
Users are still experiencing connectivity issues, latency, and timeout errors while interacting with resources. All Droplet-based services appear to be impacted.
This incident also impacts event processing, where users may experience delays or errors while creating, deleting, and processing other events such as power off/on, Snapshots, etc on Droplets and Droplet-based products like Load Balancers, Kubernetes clusters, or Database clusters, etc.
We will post an update once a fix has gone out for LON1 or we have additional information.
Posted Aug 09, 2023 - 13:03 UTC
Identified
As of 11:00 UTC, we have noticed the reoccurrence of the issue impacting multiple services. Our engineering team is actively working to mitigate it.
We will post an update as soon as additional information is available.
Posted Aug 09, 2023 - 11:24 UTC
Monitoring
Our engineering team has confirmed that the issue impacting multiple regions has been mitigated. We are continuing to monitor the situation closely as the functionality of the services recovers.
We will post an update as soon as the issue is fully resolved.
Posted Aug 09, 2023 - 10:58 UTC
Investigating
As of 09:56 UTC, Our Engineering teams are investigating an issue impacting networking in multiple regions. At this time we do not have information about the full impact, our Engineering Teams are currently investigating the issue and we will post updates shortly.
We apologize for the inconvenience and will share an update once we have more information.
Posted Aug 09, 2023 - 10:33 UTC
This incident affected: API, Billing, Cloud Control Panel, Cloud Firewall, Community, DNS, Support Center, Reserved IP, WWW, Monitoring (Global, AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SGP1, SFO1, SFO2, SFO3, SYD1, TOR1), Networking (Global, AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SFO3, SGP1, SYD1, TOR1), App Platform (Global, Amsterdam, Bangalore, Frankfurt, London, New York, San Francisco, Singapore, Sydney, Toronto), Event Processing (Global, AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SFO3, SGP1, SYD1, TOR1), VPC (Global, AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SFO3, SGP1, SYD1, TOR1), Container Registry (AMS3, FRA1, NYC3, SFO3, SGP1, SYD1), Kubernetes (AMS3, BLR1, FRA1, LON1, NYC1, NYC3, SFO2, SFO3, SGP1, SYD1, TOR1), Spaces (AMS3, FRA1, NYC3, SFO3, SGP1, SYD1), Managed Databases (AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO2, SFO3, SGP1, SYD1, TOR1), Spaces CDN (AMS3, FRA1, NYC3, SFO3, SGP1, SYD1), Load Balancers (AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SFO3, SGP1, SYD1, TOR1), Volumes (AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SFO3, SGP1, SYD1, TOR1), Droplets (AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SFO3, SGP1, SYD1, TOR1), and Functions (AMS3, BLR1, FRA1, LON1, NYC1, SFO3, SGP1, SYD1, TOR1).