Between November 11th, 2018 and November 20th, 2018 some users with Spaces stores in our NYC3 datacenter experienced degraded API availability. The availability issues were caused by a combination of events that occured within this period of time.
On November 11th, we began increasing capacity for Spaces in NYC3 by adding more drives to the cluster. This augmentation triggered an automatic data movement within the cluster in order to rebalance the load across all the storage processes, which caused us to hit an XFS bug that hung a few storage processes. The cluster’s self-remediation then kicked in to recover the Objects that had reduced data redundancy. However, due to the increased memory pressure on the system, the recovery speed was throttled.
On November 11th, the cluster experienced more storage processes hang and a decision was made to schedule an emergency maintenance to add more RAM, then reboot the system to clean up the stuck processes. On November 12th, we began performing this maintenance node by node. Immediately after completing the RAM augmentation on the first node, we found nearly all of the storage processes on that node suffered corrupted file system following the reboot.
Our engineers restored the overall cluster availability to 99.8% in approximately 30 minutes; however, this catastrophic failure caused a small portion (0.14%) of the cluster to become inaccessible to customers. To restore the full accessibility to the cluster, we needed to begin with a very time-consuming data extraction process, because extracting this data was essential to recover the impacted objects data (less than 0.01% of the total objects in the cluster).
Meanwhile, our engineers were working on the manual recovery tooling which required substantial engineering efforts.
On November 19th, we completed the data extraction, and on November 20th, our Engineering Team restored full accessibility to the cluster. Over the next few days, the team recovered most of the impacted objects using manual recovery tooling. At this time, we also identified the list of unrecoverable objects.
Spaces’ storage infrastructure is designed to handle multiple drive, host, and rack failures and has a successful track record of managing these types of scenarios in the past. However, in this instance, the interval between the failures was too brief for the recovery to restore the data redundancy for some objects before the second one happened. The impacted objects were corrupted to the point that they could not be repaired by any system level method, but an effort of extracting the raw data from the filesystem, which contributed to the length of the entire incident.
10/25 - 11/11
A few storage processes stuck in the kernel space due to hitting an XFS bug. Recovery was running but cannot catch up due to memory pressure.
1:16 UTC One more storage process hang, a decision was made to schedule a maintenance for RAM augment and clean up the hang processes by the reboot
17:04 UTC Maintenance began
18:46 UTC After power cycled the first node which completed the RAM augment, most of the storage processes experienced some form of data corruption and cannot start. At this point, 0.14% of the cluster was inaccessible to the customer
Engineers attempted to repair the corrupted filesystem to restore the down storage processes.
Engineers found a procedure that could recover the hang storage process without rebooting, started to work with datacenter team to perform the procedure against the hang processes.
Filesystem level repair restored a few processes but the majority processes were remaining down
Engineers formed a new recovery plan which is to extract the raw data from filesystem, reconstruct the object and insert them back
Data extraction began
Data extraction continued
Engineers worked on the manual recovery tooling and verified this approach could recover the impacted object correctly
Data extraction completed
Engineers restored the 90% of the inaccessible part of the cluster
Engineers restored the full accessibility of the cluster
Engineers reconstructed the impacted objects by using the raw data extracted before, then re-inserted them back to the cluster
We have made numerous changes as a result of this incident. While rebooting the host is a widespread maintenance activity for any storage cluster, in this instance, the filesystem corruption after the host reboot was caused by a misconfiguration of on-drive cache. We have reproduced the same corruption in our testing environment and deployed a periodic check to ensure such configuration is always correct.
We also identified that the XFS bug which caused storage process to hang had been fixed in the newer version of kernel. We have performed the kernel upgrade in one Spaces region and planned to do that for NYC3 Spaces in the next few weeks. Additionally, we have developed a procedure that could recover the hang storage process without rebuilding the data or rebooting. Through this incident, we have built a comprehensive toolset that can recover the data from the low-level storage backend. We are refining these tools so they can be leveraged in other manual data recovery scenarios when necessary in the future. We are working on an effort of simulating production load which includes the failure injection to discover the system's weakness when facing any multiple failures, and address them proactively.
We sincerely apologize for the impact of this lengthy incident on your work and business. We will do everything we can to learn from this event and use it to improve Spaces’ availability in the future.