FRA1 Block Storage Issue
Incident Report for DigitalOcean
Postmortem

April 1-2, 2018 Partial Block Storage Outage in FRA1

The Incident

On 2018-04-01 at 7:08 UTC, one of several storage clusters in our FRA1 region suffered a cascading failure. As a result, multiple redundant hosts in the storage cluster suffered an Out Of Memory (OOM) condition and crashed nearly simultaneously. We continue to investigate the root cause of the cascading failure.

Our storage infrastructure is designed to handle multiple drive, host, and rack failures and has a successful track record of handling these types of scenarios. However, in this particular instance, failures occurred across every domain (drive, host, rack) simultaneously and the cluster could no longer sustain the spike in customer I/O.

The recovery process was complicated by very large memory requirements at startup due to the scale of the failure, which impacted over 70% of the cluster, as well as a kernel networking driver bug that caused the hosts to lose network connectivity on the replication network and need to be rebooted.

This forced us to bring the hosts up in a carefully orchestrated sequence one device at a time, waiting for each device to complete recovery before continuing to the next one. This was an incredibly time consuming approach, but one that was necessary to prevent further cascading failures.

Timeline of Events

April 1

  • 07:09 UTC - Multiple storage nodes begin crashing due to an OOM condition
  • 07:13 UTC - Alerts for host and availability issues begin flooding in
  • 07:18 UTC - Storage on-call is paged
  • 07:20 UTC - Storage on-call call begins bringing impacted nodes back online
  • 07:25 UTC - Storage nodes start coming back online, but OOM condition occurs during the recovery process, stalling forward progress in our recovery efforts
  • 08:15 UTC - Additional Storage Engineers are called to help bring nodes up one at a time
  • 10:49 UTC - As we progress toward recovery, a decision is made to augment memory in the hosts to help speed up recovery rates
  • 11:01 UTC - 15:00 UTC (2018-02-04) - Our Datacenter team performs memory augments while another team slowly continues to bring up nodes while manually managing the memory budget of each host. We also begin investigating ways to speed up the manual process and to identify the root cause of the issue

April 2

  • 15:00 UTC - The impacted cluster is back to an operational state and final checks are performed. Engineering continues to investigate the root cause of the initial OOM condition that led to the failure. Decision is made to re-enable the cluster under controlled conditions to see if we can identify the possible cause
  • 15:30 UTC - The impacted cluster is re-enabled and the Engineering team observes immediate and rapid growth in memory consumption
  • 15:32 UTC - Customer I/O is suspended again, but not quickly enough, and a number of nodes are lost. However, we were able to capture the likely cause of the OOM
  • 15:35 UTC - The engineering team immediately begins bringing the nodes back up using tooling developed during the outage to speed things up significantly. We continue to investigate a possible fix for the OOMs
  • Approximately 18:30 UTC - The impacted cluster is back up and operational, but one node experiences a hardware failure
  • 19:16 UTC - The failed host is now operational. Adjustments in configuration are made in an attempt to mitigate the OOM condition based on our observation and analysis of the 15:32 UTC event
  • 19:30 UTC - Customer I/O is re-enabled, and the Engineering team observes the same pattern of activity that caused the OOM at 15:32 UTC, but this time memory usage is holding at normal levels

Future Measures

We are continuing our investigation into the root cause of this incident, and will post an update with additional details as soon as the cause is confirmed. In the meantime, we are taking numerous steps in an effort to prevent similar incidents from happening in the future, including: Reproducing the conditions that we believe led to this event to verify the root cause in our test/staging environments. Verifying that our countermeasures will effectively prevent the issue in our test environment. Deploying the mitigations to all production regions and clusters to prevent another outage of this type. Refining the tooling built during this outage to fully handle automated bring-up under these conditions Augmenting memory for storage nodes that may be at risk.

In Conclusion

We recognize the impact this incident had on your work and business, and sincerely apologize for the frustrations caused by this lengthy outage.

Posted 8 months ago. Apr 06, 2018 - 20:24 UTC

Resolved
We have reenabled customer traffic to the affected cluster in FRA1, and users should no longer experience issues with Block Storage availability. You should now be able to re-attach volumes and resume use of Block Storage on all Droplets in the region. In some cases, it may be necessary to power off your Droplet and then power it back on before reattaching the volume. We appreciate your patience throughout this incident and sincerely apologize for the frustrations. If you continue to experience issues, please open a ticket with out support team.
Posted 8 months ago. Apr 02, 2018 - 20:47 UTC
Monitoring
We have reenabled customer traffic to the affected cluster in FRA1, and users should no longer experience issues with Block Storage availability. We are monitoring the situation and will post another update as soon as we have confirmed full resolution. We appreciate your patience throughout this incident.
Posted 8 months ago. Apr 02, 2018 - 19:37 UTC
Update
In our efforts to mitigate the issue impacting Block Storage in FRA1, we attempted to re-enable customer traffic on the affected cluster but were immediately deluged with a large backlog of queued cleanup requests. We altered our approach to cleanup requests and have recovered the cluster to the point where we will soon re-attempt to release to customers, at which time we will provide another update.
Posted 8 months ago. Apr 02, 2018 - 18:59 UTC
Update
In our efforts to mitigate the issue impacting Block Storage in FRA1, we attempted to re-enable customer traffic on the affected cluster but were immediately deluged with a large backlog of queued cleanup requests. We are now changing how these cleanup requests are handled in the cluster to enable us to return to normal service. We estimate another hour and a half until we fully recover from the large backlog and are able to try enabling customer access again.
Posted 8 months ago. Apr 02, 2018 - 17:00 UTC
Update
We are in the final stages of cluster validation and balancing to resolve the issue impacting Block Storage in FRA1 for some users. Once this activity is complete, we will re-enable customer traffic and anticipate full resolution within the next two hours.
Posted 8 months ago. Apr 02, 2018 - 15:13 UTC
Update
Our engineering team has been working around the clock to resolve the issue impacting Block Storage availability in FRA1. We have completed augmenting the memory in the cluster, which has allowed us to return nodes to a healthier state. We are working diligently to get all backend services back online and tested before we reenable customer workloads. We expect to be able to provide more frequent updates as we work towards complete resolution.
Posted 8 months ago. Apr 02, 2018 - 14:04 UTC
Update
Issues with Block Storage availability in FRA1 continue to impact some users. Users may be unable to access data stored on volumes, and may also see delays booting or power cycling Droplets with volumes attached. Users creating new volumes in this region should not experience any issues.

We are working to bring the affected cluster back up with data intact by augmenting memory. This effort should resolve the issue with the affected cluster, and also improve recovery time in the future. We will post an update once the issue has been fully resolved.
Posted 8 months ago. Apr 02, 2018 - 10:24 UTC
Update
Issues with Block Storage availability in FRA1 continue to impact some users. Until the issue is resolved, users may not be able to access data stored on volumes, and may also see delays booting or power cycling Droplets with volumes attached. Users creating new volumes in this region should not experience any issues.

We are working to bring the affected cluster back up with data intact by augmenting memory. We anticipate this recovery maintenance will last through the night and will resolve the issue with the affected cluster, and also improve recovery time in the future. We will post an update in the morning (ET), or sooner if we have an update.
Posted 8 months ago. Apr 02, 2018 - 02:20 UTC
Update
The ongoing issue with Block Storage in FRA1 continues to impact some users. Until the issue is resolved, users may not be able to access data stored on volumes, and may also see delays booting or power cycling Droplets with volumes attached. We are working to bring the cluster back up and have blocked all user traffic to the affected cluster. While we make progress towards recovery, we currently do not have an ETA for when all volumes will be available in the region. Resolving this issue is our highest priority and we apologize for the frustrations caused.
Posted 8 months ago. Apr 01, 2018 - 23:03 UTC
Update
We have an ongoing incident in FRA1 impacting Block Storage availability for some users. Until the issue is resolved, impacted users may not be able to access data stored on volumes, and may also see delays booting or power cycling Droplets with volumes attached.

Roughly 12 hours ago, one of the Ceph clusters that powers Block Storage in FRA1 encountered a situation which triggered a cascading shutdown of the entire cluster. We are working to bring the cluster back up, but have been slowed by the recovery time for each of the Ceph storage nodes.

Unfortunately, we do not have any ETA for when all volumes will be available in the region, but we are working on this with our highest priority. We apologize for the frustrations caused by this incident and we will be publishing a postmortem once we have the incident fully resolved and all the information fully documented.
Posted 8 months ago. Apr 01, 2018 - 20:59 UTC
Update
Our engineering team continues troubleshooting to resolve the issue impacting Block Storage in our FRA1 region, where one cluster is affected. We apologize for the frustrations caused by this outage and will share updates as they become available.
Posted 8 months ago. Apr 01, 2018 - 17:26 UTC
Update
Our engineering team has identified the cause of the issue impacting Block Storage in our FRA1 region. One cluster in the FRA1 region is affected, and we continue to work towards a fix for this cluster. We appreciate your patience and will post an update once we have more information.
Posted 8 months ago. Apr 01, 2018 - 15:25 UTC
Update
Our engineering team continues to work to resolve the issue with Block Storage in our FRA1 region. We appreciate your patience and will post an update once a fix is in place
Posted 8 months ago. Apr 01, 2018 - 12:44 UTC
Update
Our engineering team continues to work on a fix for the issue with Block Storage in our FRA1 region. We appreciate your patience and will post an update as soon as additional information is available.
Posted 8 months ago. Apr 01, 2018 - 11:01 UTC
Identified
Our engineering team has identified the cause of the issue with Block Storage in our FRA1 region and is actively working on a fix. We will post an update as soon as additional information is available.
Posted 8 months ago. Apr 01, 2018 - 10:09 UTC
Update
Our engineering team is still investigating the issue with Block Storage in our FRA1 region. We apologize for the inconvenience and will post an update as soon as additional information is available.
Posted 8 months ago. Apr 01, 2018 - 09:23 UTC
Update
Our engineering team continues to investigate the issue with Block Storage in our FRA1 region. We appreciate your patience and will post an update as soon as additional information is available.
Posted 8 months ago. Apr 01, 2018 - 08:05 UTC
Investigating
Our engineering team is investigating an issue with Block Storage in our FRA1 region. We apologize for the inconvenience and will share an update once we have more information.
Posted 8 months ago. Apr 01, 2018 - 07:31 UTC
This incident affected: Services (Block Storage) and Regions (FRA1).