Multiple products unreachable in NYC1
Incident Report for DigitalOcean
Postmortem

The Incident

On June 7, 2021, our Network Engineering team was performing work to build out new racks in our NYC1 data center. As part of this process, it was necessary to wipe out the configuration (“zeroize”) on the new network devices in order to allow them to reprovision themselves to a newer software version. This had to be done manually by our Network Engineering team, whereas new network devices are normally provisioned via automation. During this process, an error was made that resulted in the wrong zone being targeted, wiping out the configurations on two production racks.

This resulted in a complete outage of all services hosted on those two racks from 11:12 UTC to 11:47 UTC. 

Timeline of Events

11:12 UTC - Top of rack switch configs are erased due to human error

11:14 UTC - Engineering recognizes the mistake and alerts internal teams to begin remediation

11:40 UTC - One out of the two switches is recovered and services start to come back online

11:47 UTC - Both switches recovered and all services confirmed back online 

Future Measures

The Network Engineering team will leverage existing automation as well as add new functions to perform "zeroize" operations - manual connections to network devices will no longer be needed. In order to avoid human error, tooling will include two important verifications:

  • Ensure that devices are either in "Staged" status (which is used for new/in-progress builds) or in "Decommissioning." All devices in production will be in "Active" state and zeroizing will then be impossible.
  • When redundant devices are targeted at the same time, our automation will ask for a double-check and a confirmation—regardless of the devices’ status.

In Conclusion

We strive to keep DigitalOcean services available and reliable for our customers who run their businesses and projects on our platform. We apologize for the inconvenience this outage caused and will continue to use the lessons learned to drive improvements in both our systems and processes.

Posted Jun 12, 2021 - 00:04 UTC

Resolved
Our Engineering team has resolved the issues impacting Droplets and Droplet-backed products in our NYC1 region, and all services should now be functioning normally. Thank you for your patience and understanding throughout this process. Please open a ticket with our Support team if you encounter any further issues at all.
Posted Jun 07, 2021 - 12:27 UTC
Update
Our Engineering team has implemented a fix regarding the issue with Droplets and dependent products in NYC1. We're now monitoring the situation and will post an update as soon as the issue is fully resolved.
Posted Jun 07, 2021 - 11:51 UTC
Identified
Our Engineering team has Identified an an issue in NYC1. During this time, Droplets and dependent products such as Load Balancers, Managed Databases, and Managed Kubernetes might be unreachable. We apologize for the inconvenience and will share an update once we have more information.
Posted Jun 07, 2021 - 11:33 UTC
This incident affected: Regions (NYC1) and Services (Droplets, Kubernetes, Load Balancers, Managed Databases).