On September 25, 2024 at 22:25 UTC, DigitalOcean experienced a reduction of datacenter capacity in SFO3 and impacted the availability of select DigitalOcean services. Due to a majority of the line cards rebooting at the same time on one of our core routers in SFO3, an inter-regional traffic interruption and traffic drop to the network backbone occurred. This issue impacted users of any DigitalOcean services in the SFO3 region, with a longer impact on select Managed Kubernetes Clusters (DOKS).
Networking
Specific Impact on DOKS
Sep 25 22:21 - Large majority of line cards rebooted on the core router.
Sep 25 22:24 - Line cards became online.
Sep 25 22:25 - Network protocols started session establishment process.
Sep 25 22:30 - Traffic on the affected core router was restored.
Sep 25 22:50 - SFO3 control plane systems all reconnected and recovered.
Sep 25 23:07 - DOKS API servers degraded.
Sep 25 23:59 - Some DOKS clusters in the SFO3 region could not be scraped. Several nodes were discovered to be in a “not ready” state.
Sep 26 01:40 - All impacted DOKS nodes recycled and clusters are operational.
DigitalOcean teams are working on multiple types of remediation to help prevent a similar incident from happening in the future.
DigitalOcean is working with the vendor support team for the devices to determine the root cause of the line card crash, as well as upgrading software on the core routers in the SFO3 region..
During the incident, engineers had to manually remediate affected nodes across the entire SFO3 DOKS fleet to restore service. Teams are exploring methods to reduce the need for manual action in the future, by increasing thresholds for automated remediation actions, such that service is restored as quickly as possible.