Block Storage Issues Across All Regions
Incident Report for DigitalOcean
Postmortem

The Incident

On Monday September 30th, 2019, customers of our Volumes Block Storage product experienced an outage between 20:46 UTC and 23:58 UTC. Specifically, the impacted regions were: AMS3, BLR1, LON1, NYC3, and TOR1. Some clusters remained unaffected through this event, which allowed partial service for volumes residing on the unaffected clusters. Individual volumes may have experienced a shorter outage.

The outage was triggered as a result of a networking configuration change on the Block Storage clusters to improve handling packet loss scenarios. The new setting caused incompatibilities, which led to network interfaces becoming unavailable. Once alerted to the outage, our engineering teams responded by rolling back the configuration change, and then resetting the network interfaces across all affected cluster nodes.

Because DigitalOcean Kubernetes Service (DOKS) leverages Volumes Block Storage, consumers of the Kubernetes API, including customers and services within the cluster, would also have experienced an outage or degraded access. Among other impacts, this would have prevented the scheduling and modification of workloads in the cluster. Additionally, customer applications using Kubernetes Persistent Volumes are backed by DigitalOcean Volumes Block Storage and would have been impacted.

Timeline of Events

20:40 UTC - New networking configuration change goes live into production.

20:54 UTC - Block Storage SLO based alerts received.

20:54 - 21:05 UTC - Pattern of cluster throughput dips established for many clusters, with the new networking configuration change identified as likely culprit. Incident channel spun up.

21:08 UTC - DigitalOcean receives customer reports of performance degradation for Volumes attached to Droplets.

21:17 UTC - DigitalOcean Kubernetes cluster availability alerts > 1% of clusters are in a degraded state (Kubernetes API fails to respond).

21:22 UTC - Network configuration change rolled back.

22:29 UTC - BLR1 Block Storage fully recovered.

23:02 UTC - TOR1 Block Storage fully recovered.

23:11 UTC - LON1 Block Storage fully recovered.

23:31 UTC - NYC3 Block Storage fully recovered.

23:43 UTC - AMS3 Block Storage fully recovered.

23:58 UTC - FRA1 Block Storage fully recovered.

23:58 UTC - The Block Storage infrastructure is fully recovered across all regions and status is updated to reflect that Engineering is monitoring the fix.

Future Measures

Efforts continue to find the incompatibility in the networking configuration change. Additionally, we are exploring improvements to our tools and processes to facilitate a finer grained, more incremental deployment method for wide, system-level changes.

In Conclusion

The stability of our platform and services is incredibly important to us, and we sincerely apologize for the impact and duration of this outage. Customers who have had their Volumes Block Storage impacted by this outage will receive a credit applied towards their next billing cycle.

Posted 18 days ago. Oct 04, 2019 - 15:42 UTC

Resolved
Our engineering team has resolved the issues with Block Storage. All systems should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted 21 days ago. Oct 01, 2019 - 03:41 UTC
Update
Our engineering team has resolved the issues with Block Storage. All systems should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted 21 days ago. Oct 01, 2019 - 01:38 UTC
Monitoring
Our engineering team has implemented a fix to resolve the issue with Block Storage and Kubernetes clusters and is monitoring the situation. We will post an update as soon as the issue is fully resolved.
Posted 22 days ago. Sep 30, 2019 - 23:59 UTC
Update
Our engineering team has identified an issue with Block Storage across all regions. During this time you may experience higher latency with our Volume Service. We apologize for the inconvenience and will share an update once we have more information and will be sure to update this as individual regions come back online.
Posted 22 days ago. Sep 30, 2019 - 23:27 UTC
Update
Our engineering team has identified an issue with Block Storage across all regions. During this time you may experience higher latency with our Volume Service. We apologize for the inconvenience and will share an update once we have more information.
Posted 22 days ago. Sep 30, 2019 - 23:04 UTC
Update
Our engineering team has identified an issue with Block Storage across all regions. During this time you may experience higher latency with our Volume Service. We apologize for the inconvenience and will share an update once we have more information.
Posted 22 days ago. Sep 30, 2019 - 22:55 UTC
Identified
Our engineering team has identified an issue with Block Storage across all regions. During this time you may experience higher latency with our Volume Service. We apologize for the inconvenience and will share an update once we have more information.
Posted 22 days ago. Sep 30, 2019 - 21:25 UTC
This incident affected: Regions (AMS2, AMS3, BLR1, FRA1, LON1, NYC1, NYC2, NYC3, SFO1, SFO2, SGP1, TOR1) and Services (Block Storage).