NYC3 Spaces and CDN Performance Issues
Incident Report for DigitalOcean
Postmortem

The Incident

On May 1, 2019, the Spaces storage cluster in NYC3 suffered a cascading network interface failure. This failure caused data nodes to drop off the network and force a large segment of the cluster into data loss protection mode (meaning the cluster no longer serviced writes). This resulted in a drop of S3 API availability down to ~20%.

After resetting the network hardware on the data nodes, our engineering team discovered a hardware configuration problem that blocked restoration of services above 60% of capacity until an on-premise intervention from a datacenter technician could correct the issue.

The impact of this incident was a significant disruption to data availability, with the cluster in a degraded state for approximately 2.5 hours.

Timeline of Events

11:55 UTC - NYC3 Spaces Storage Cluster observed disruption in network traffic (no loss of availability)

12:01 UTC - The cluster begins recovering from initial network disruption, causing an expected spike in traffic on the backend storage network

12:06 UTC - Initial network hardware lockup detected on one data node

12:06-12:15 UTC - Additional nodes begin observing similar hardware lockup

12:16 UTC - Cluster takes data offline to prevent data loss; availability drops to 20% and the storage engineering team is paged in

12:23 UTC - Storage engineers begin diagnosing the issue and observe locked up network hardware on multiple data hosts in the cluster, resulting in full loss of connectivity between data nodes on the backend storage network

12:25-13:25 UTC - Storage engineers begin resetting network hardware to bring connectivity back up, but observe the hardware locking up again. Engineering disables the faulty hardware on all nodes down and brings only one of the network interfaces up on each data node

13:30 UTC - Cluster recovers data availability back to ~60%, but exposes a hardware misconfiguration on one of the nodes

14:45 UTC - Data center team is paged to correct the hardware configuration on the misconfigured data node

14:45 UTC - Availability returns to 100% and services are fully restored

16:07 UTC - Engineering begins resetting the failed network links and bringing them back online and restore uplink redundancy

17:55 UTC - Cluster completes internal recovery back to good status

19:00 UTC - Additional alerting is put into place to catch any recurrence of the issue before it can cascade to additional nodes

Future Measures

We have performed a full audit of all systems to ensure there are no other hardware misconfigurations. Along with the aforementioned additional alerting, we are enhancing internal processes and adding automated tooling to periodically examine the cluster configuration. Lastly, we are also working with the vendor to deploy a fix for the network hardware failure that triggered the event.

In Conclusion

We take the stability of our services seriously and will ensure we work to improve in all areas where we can. We apologize for the inconvenience caused by this outage.

Posted 9 days ago. May 17, 2019 - 19:49 UTC

Resolved
Our engineering team has resolved the issue with degraded Spaces and CDN performance in our NYC3 region. If you continue to experience issues, please open a ticket with our support team. We apologize for any inconvenience.
Posted 25 days ago. May 01, 2019 - 19:00 UTC
Monitoring
Our engineering team has implemented a fix to resolve the issue with degraded Spaces and CDN performance and is monitoring the situation. We will post an update as soon as the issue is fully resolved.
Posted 26 days ago. May 01, 2019 - 14:56 UTC
Update
Our Engineering team continues to investigate the issue impacting Spaces performance in our NYC3 region. At this time customers may experience intermittent availability issues with the API and objects in NYC3 Spaces, as well as issues with the Spaces CDN on objects in NYC3. We apologize for the inconvenience and will share another update as soon as possible.
Posted 26 days ago. May 01, 2019 - 14:25 UTC
Investigating
Our engineering team is investigating an issue impacting Spaces performance in our NYC3 region. During this time, customers may experience intermittent availability issues with the API and objects in NYC3 Spaces. We will share additional updates as soon as we have more information.
Posted 26 days ago. May 01, 2019 - 12:21 UTC
This incident affected: Regions (NYC3) and Services (Spaces).