DigitalOcean Services Status

Network Connectivity in SFO3 Region
Incident Report for DigitalOcean
Postmortem

Incident Summary

On September 25, 2024 at 22:25 UTC, DigitalOcean experienced a reduction of datacenter capacity in SFO3 and impacted the availability of select DigitalOcean services. Due to a majority of the line cards rebooting at the same time on one of our core routers in SFO3, an inter-regional traffic interruption and traffic drop to the network backbone occurred. This issue impacted users of any DigitalOcean services in the SFO3 region, with a longer impact on select Managed Kubernetes Clusters (DOKS). 

Incident Details

Networking

  • Root Cause: Several line cards rebooted at the same time on one of the core routers, due to hardware errors on the network device.
  • Impact: Datacenter traffic capacity to/from the backbone was reduced by half during this incident. Network connectivity to some DigitalOcean services was affected.
  • Response: All of the crashed line cards came online quickly, allowing network traffic to begin flowing again and the core router to become operational.

Specific Impact on DOKS

  • Root Cause: Due to the network issues from line card reboots, a number of DOKS fleet machines became unhealthy as guest networking failed to recover from the hardware fault.
  • Impact: Some customer clusters experienced connectivity issues and difficulty in accessing the K8s control plane until networking for the underlying nodes was restored.
  • Response: All affected nodes across the SFO3 region in the DOKS fleet were recycled.

Timeline of Events

Sep 25 22:21 - Large majority of line cards rebooted on the core router.

Sep 25 22:24 - Line cards became online.

Sep 25 22:25 - Network protocols started session establishment process.

Sep 25 22:30 - Traffic on the affected core router was restored.

Sep 25 22:50 - SFO3 control plane systems all reconnected and recovered. 

Sep 25 23:07 - DOKS API servers degraded.

Sep 25 23:59 - Some DOKS clusters in the SFO3 region could not be scraped. Several nodes were discovered to be in a “not ready” state.

Sep 26 01:40 - All impacted DOKS nodes recycled and clusters are operational. 

Remediation Actions

DigitalOcean teams are working on multiple types of remediation to help prevent a similar incident from happening in the future. 

DigitalOcean is working with the vendor support team for the devices to determine the root cause of the line card crash, as well as upgrading software on the core routers in the SFO3 region.. 

During the incident, engineers had to manually remediate affected nodes across the entire SFO3 DOKS fleet to restore service. Teams are exploring methods to reduce the need for manual action in the future, by increasing thresholds for automated remediation actions, such that service is restored as quickly as possible.

Posted Oct 04, 2024 - 17:38 UTC

Resolved
From 22:25 to 22:50 UTC, users may have experienced network connectivity issues with all DigitalOcean services deployed in the SFO3 region. Additionally, from 22:25 until 01:40 UTC, users may have encountered network connectivity issues affecting their Managed Kubernetes clusters in the SFO3 region.

Our Engineering team has fully resolved the issues impacting connectivity in the region, and all services are now operating normally.

If you continue to experience any problems, please open a ticket with our Support team. We apologize for any inconvenience caused.
Posted Sep 26, 2024 - 02:45 UTC
Monitoring
Our Engineering team has implemented a fix to resolve the issue affecting networking connectivity with Managed Kubernetes clusters in the SFO3 region and is closely monitoring the situation. We will provide an update as soon as the issue is fully resolved.
Posted Sep 26, 2024 - 02:15 UTC
Identified
Our Engineering team has identified the cause of the networking connectivity issues with Managed Kubernetes clusters in the SFO3 region and is actively working on rolling out a fix. Once we have additional information, we will share another update.
Posted Sep 26, 2024 - 01:27 UTC
Update
We are continuing to investigate the network connectivity issue with Managed Kubernetes clusters.

During this investigation, we have also identified an earlier period of network connectivity disruption affecting all services in the region, from 22:25 to 22:50 UTC.

At this time, some Managed Kubernetes clusters may continue experiencing network connectivity issues, but all other services in the region should be accessible as normal.
Posted Sep 26, 2024 - 00:38 UTC
Investigating
Our Engineering team is currently investigating issues with network connectivity for Managed Kubernetes clusters in the SFO3 region. During this time, some users may experience difficulties connecting to their Managed Kubernetes clusters, or connecting to other services from their clusters.

We apologize for the inconvenience and will provide an update as soon as we have more information.
Posted Sep 25, 2024 - 23:34 UTC
This incident affected: Kubernetes (SFO3).