On Monday September 30th, 2019, customers of our Volumes Block Storage product experienced an outage between 20:46 UTC and 23:58 UTC. Specifically, the impacted regions were: AMS3, BLR1, LON1, NYC3, and TOR1. Some clusters remained unaffected through this event, which allowed partial service for volumes residing on the unaffected clusters. Individual volumes may have experienced a shorter outage.
The outage was triggered as a result of a networking configuration change on the Block Storage clusters to improve handling packet loss scenarios. The new setting caused incompatibilities, which led to network interfaces becoming unavailable. Once alerted to the outage, our engineering teams responded by rolling back the configuration change, and then resetting the network interfaces across all affected cluster nodes.
Because DigitalOcean Kubernetes Service (DOKS) leverages Volumes Block Storage, consumers of the Kubernetes API, including customers and services within the cluster, would also have experienced an outage or degraded access. Among other impacts, this would have prevented the scheduling and modification of workloads in the cluster. Additionally, customer applications using Kubernetes Persistent Volumes are backed by DigitalOcean Volumes Block Storage and would have been impacted.
20:40 UTC - New networking configuration change goes live into production.
20:54 UTC - Block Storage SLO based alerts received.
20:54 - 21:05 UTC - Pattern of cluster throughput dips established for many clusters, with the new networking configuration change identified as likely culprit. Incident channel spun up.
21:08 UTC - DigitalOcean receives customer reports of performance degradation for Volumes attached to Droplets.
21:17 UTC - DigitalOcean Kubernetes cluster availability alerts > 1% of clusters are in a degraded state (Kubernetes API fails to respond).
21:22 UTC - Network configuration change rolled back.
22:29 UTC - BLR1 Block Storage fully recovered.
23:02 UTC - TOR1 Block Storage fully recovered.
23:11 UTC - LON1 Block Storage fully recovered.
23:31 UTC - NYC3 Block Storage fully recovered.
23:43 UTC - AMS3 Block Storage fully recovered.
23:58 UTC - FRA1 Block Storage fully recovered.
23:58 UTC - The Block Storage infrastructure is fully recovered across all regions and status is updated to reflect that Engineering is monitoring the fix.
Efforts continue to find the incompatibility in the networking configuration change. Additionally, we are exploring improvements to our tools and processes to facilitate a finer grained, more incremental deployment method for wide, system-level changes.
The stability of our platform and services is incredibly important to us, and we sincerely apologize for the impact and duration of this outage. Customers who have had their Volumes Block Storage impacted by this outage will receive a credit applied towards their next billing cycle.