DigitalOcean Services Status

Network Connectivity in SFO3 Region
Incident Report for DigitalOcean
Postmortem

Postmortem

We post updates on incidents regularly to cover the details of our findings, learnings and corresponding actions to prevent recurrence. For this update, we will cover a recent incident related to network maintenance. Maintenance is an essential part of ensuring service stability at DigitalOcean, through scheduled upgrades, patches, and more. We work hard to plan for no interruptions from maintenance activities and we recognize the impact that maintenance can have on our customers if there is downtime. We apologize for the disruption that occurred and have identified action items to ensure maintenance activities are planned with minimal downtime and that we have plans to minimize impact, even when unforeseen errors occur. We go into detail about this below.

Incident Summary

On Jan. 7, 2025 12:10 UTC, DigitalOcean experienced a loss of network connectivity for Regional External Load Balancers, DOKS, and AppPlatform products in the SFO3 region. This impact was the result of an unexpected effect of scheduled maintenance work to upgrade our infrastructure and enhance network performance. The maintenance work to complete the network change was designed to be seamless, with the worst expected case being dropped packets. The resolution rollout began Jan. 7th at 13:15 UTC and was fully completed with all services reported operational Jan. 7th at 14:30 UTC.

Incident Details

The DigitalOcean Networking team started performing scheduled maintenance to enhance network performance at 10:00 UTC. As part of this maintenance, a routing change was rolled out at 12:10 UTC that would redirect the traffic over to a new path in the datacenter. There was an old routing configuration present on the Core switches that did not get updated. This resulted in traffic being dropped for products relying on Regional External Load Balancers (e.g. Droplets, DOKS, AppPlatform). As soon as we detected the drop in traffic, we reverted the changes to mitigate the impact. 

Timeline of Events

12:12 UTC - Flip to the new workflow was started and the impact began 

12:40 UTC - Alert fired for Load Balancers

13:00 UTC - First report from customers was received

13:13 UTC - Revert of the new code was started

13:21 UTC - Incident was spun up

14:19 UTC - Routing configuration was updated on Core Switches 

14:30 UTC - Code revert was completed and service functionality was restored

Remediation Actions

Following a full internal postmortem, DigitalOcean engineers identified several areas of learning to prevent similar incidents. These include updating maintenance processes, increasing monitoring and alerting, and improving observability of network services. 

Key Learnings:

  1. Learning: Gaps in our automated validation of Network configuration setup for the rollout of this enhancement were identified. While our new, upgraded network results in a simpler state, we have inherited interim complexity as we transition from old to new.
  • Actions: The gaps in validation automation that resulted in this incident have been addressed. In addition, automated audit processes to further harden the validation process across all network configurations will be applied.

  1. Learning: Gaps in our incremental rollout process for network datapath changes for both pre- and post-deployment were identified. The process for these changes allowed for quicker rollouts for small changes that passed validation. We recognize that "small changes" can have a high impact.
  • Actions: All network changes will follow an incremental rollout plan, regardless of whether or not validation processes determine that it is a minor change. Runbooks have been updated with reachability tests to encompass Load Balancers and Kubernetes clusters for this type of maintenance. Additional testing steps in both pre-maintenance and post-maintenance procedures to confirm the success of the maintenance have also been implemented.
Posted Jan 16, 2025 - 04:58 UTC

Resolved
From 13:09 - 14:48 UTC, we experienced an issue with IPv6 network connectivity in our SFO3 region. Our Engineering team was able to identify the root cause, revert changes, and resolve the issue.

During that time, connectivity via IPv6 to services in SFO3 was impacted. This included, in particular, connectivity to and provisioning of Load Balancers, control plane connectivity to Kubernetes clusters, and connectivity to App Platform Apps.

If you continue to experience problems, please open a ticket with our support team. Thank you for your patience and we apologize for any inconvenience.
Posted Jan 07, 2025 - 16:43 UTC
Monitoring
Our engineering team has implemented a fix to resolve the networking issue in our SFO3 region. Users should no longer encounter connectivity issues, provisioning Load Balancers, latency, and timeout errors while interacting with resources in the SFO3 region.

We will post an update as soon as the issue is fully resolved.
Posted Jan 07, 2025 - 15:51 UTC
Update
Our Engineering team is continuing to investigate the cause of the issue with networking in our SFO3 region and is taking action to mitigate this incident by reverting recent changes. In addition to connectivity issues, users may see failures with provisioning Load Balancers in SFO3.

We are seeing some services recover, but expect to see a few more short periods of network connectivity issues as the reverts progress.

We will continue to share updates as soon as we have new information.
Posted Jan 07, 2025 - 14:47 UTC
Investigating
Our engineering team is investigating an issue related to the SFO3 network maintenance here: https://status.digitalocean.com/incidents/lf0njg8ysj0w

During this time users may experience connectivity issues, latency, and timeout errors while interacting with resources in the SFO3 data center.

We apologize for the inconvenience and will share an update once we have more information.
Posted Jan 07, 2025 - 13:35 UTC
This incident affected: Networking (SFO3).