On May 1, 2019, the Spaces storage cluster in NYC3 suffered a cascading network interface failure. This failure caused data nodes to drop off the network and force a large segment of the cluster into data loss protection mode (meaning the cluster no longer serviced writes). This resulted in a drop of S3 API availability down to ~20%.
After resetting the network hardware on the data nodes, our engineering team discovered a hardware configuration problem that blocked restoration of services above 60% of capacity until an on-premise intervention from a datacenter technician could correct the issue.
The impact of this incident was a significant disruption to data availability, with the cluster in a degraded state for approximately 2.5 hours.
11:55 UTC - NYC3 Spaces Storage Cluster observed disruption in network traffic (no loss of availability)
12:01 UTC - The cluster begins recovering from initial network disruption, causing an expected spike in traffic on the backend storage network
12:06 UTC - Initial network hardware lockup detected on one data node
12:06-12:15 UTC - Additional nodes begin observing similar hardware lockup
12:16 UTC - Cluster takes data offline to prevent data loss; availability drops to 20% and the storage engineering team is paged in
12:23 UTC - Storage engineers begin diagnosing the issue and observe locked up network hardware on multiple data hosts in the cluster, resulting in full loss of connectivity between data nodes on the backend storage network
12:25-13:25 UTC - Storage engineers begin resetting network hardware to bring connectivity back up, but observe the hardware locking up again. Engineering disables the faulty hardware on all nodes down and brings only one of the network interfaces up on each data node
13:30 UTC - Cluster recovers data availability back to ~60%, but exposes a hardware misconfiguration on one of the nodes
14:45 UTC - Data center team is paged to correct the hardware configuration on the misconfigured data node
14:45 UTC - Availability returns to 100% and services are fully restored
16:07 UTC - Engineering begins resetting the failed network links and bringing them back online and restore uplink redundancy
17:55 UTC - Cluster completes internal recovery back to good status
19:00 UTC - Additional alerting is put into place to catch any recurrence of the issue before it can cascade to additional nodes
We have performed a full audit of all systems to ensure there are no other hardware misconfigurations. Along with the aforementioned additional alerting, we are enhancing internal processes and adding automated tooling to periodically examine the cluster configuration. Lastly, we are also working with the vendor to deploy a fix for the network hardware failure that triggered the event.
We take the stability of our services seriously and will ensure we work to improve in all areas where we can. We apologize for the inconvenience caused by this outage.