April 1-2, 2018 Partial Block Storage Outage in FRA1
On 2018-04-01 at 7:08 UTC, one of several storage clusters in our FRA1 region suffered a cascading failure. As a result, multiple redundant hosts in the storage cluster suffered an Out Of Memory (OOM) condition and crashed nearly simultaneously. We continue to investigate the root cause of the cascading failure.
Our storage infrastructure is designed to handle multiple drive, host, and rack failures and has a successful track record of handling these types of scenarios. However, in this particular instance, failures occurred across every domain (drive, host, rack) simultaneously and the cluster could no longer sustain the spike in customer I/O.
The recovery process was complicated by very large memory requirements at startup due to the scale of the failure, which impacted over 70% of the cluster, as well as a kernel networking driver bug that caused the hosts to lose network connectivity on the replication network and need to be rebooted.
This forced us to bring the hosts up in a carefully orchestrated sequence one device at a time, waiting for each device to complete recovery before continuing to the next one. This was an incredibly time consuming approach, but one that was necessary to prevent further cascading failures.
Timeline of Events
- 07:09 UTC - Multiple storage nodes begin crashing due to an OOM condition
- 07:13 UTC - Alerts for host and availability issues begin flooding in
- 07:18 UTC - Storage on-call is paged
- 07:20 UTC - Storage on-call call begins bringing impacted nodes back online
- 07:25 UTC - Storage nodes start coming back online, but OOM condition occurs during the recovery process, stalling forward progress in our recovery efforts
- 08:15 UTC - Additional Storage Engineers are called to help bring nodes up one at a time
- 10:49 UTC - As we progress toward recovery, a decision is made to augment memory in the hosts to help speed up recovery rates
- 11:01 UTC - 15:00 UTC (2018-02-04) - Our Datacenter team performs memory augments while another team slowly continues to bring up nodes while manually managing the memory budget of each host. We also begin investigating ways to speed up the manual process and to identify the root cause of the issue
- 15:00 UTC - The impacted cluster is back to an operational state and final checks are performed. Engineering continues to investigate the root cause of the initial OOM condition that led to the failure. Decision is made to re-enable the cluster under controlled conditions to see if we can identify the possible cause
- 15:30 UTC - The impacted cluster is re-enabled and the Engineering team observes immediate and rapid growth in memory consumption
- 15:32 UTC - Customer I/O is suspended again, but not quickly enough, and a number of nodes are lost. However, we were able to capture the likely cause of the OOM
- 15:35 UTC - The engineering team immediately begins bringing the nodes back up using tooling developed during the outage to speed things up significantly. We continue to investigate a possible fix for the OOMs
- Approximately 18:30 UTC - The impacted cluster is back up and operational, but one node experiences a hardware failure
- 19:16 UTC - The failed host is now operational. Adjustments in configuration are made in an attempt to mitigate the OOM condition based on our observation and analysis of the 15:32 UTC event
- 19:30 UTC - Customer I/O is re-enabled, and the Engineering team observes the same pattern of activity that caused the OOM at 15:32 UTC, but this time memory usage is holding at normal levels
We are continuing our investigation into the root cause of this incident, and will post an update with additional details as soon as the cause is confirmed. In the meantime, we are taking numerous steps in an effort to prevent similar incidents from happening in the future, including:
Reproducing the conditions that we believe led to this event to verify the root cause in our test/staging environments.
Verifying that our countermeasures will effectively prevent the issue in our test environment.
Deploying the mitigations to all production regions and clusters to prevent another outage of this type.
Refining the tooling built during this outage to fully handle automated bring-up under these conditions
Augmenting memory for storage nodes that may be at risk.
We recognize the impact this incident had on your work and business, and sincerely apologize for the frustrations caused by this lengthy outage.