On June 7, 2021, our Network Engineering team was performing work to build out new racks in our NYC1 data center. As part of this process, it was necessary to wipe out the configuration (“zeroize”) on the new network devices in order to allow them to reprovision themselves to a newer software version. This had to be done manually by our Network Engineering team, whereas new network devices are normally provisioned via automation. During this process, an error was made that resulted in the wrong zone being targeted, wiping out the configurations on two production racks.
This resulted in a complete outage of all services hosted on those two racks from 11:12 UTC to 11:47 UTC.
11:12 UTC - Top of rack switch configs are erased due to human error
11:14 UTC - Engineering recognizes the mistake and alerts internal teams to begin remediation
11:40 UTC - One out of the two switches is recovered and services start to come back online
11:47 UTC - Both switches recovered and all services confirmed back online
The Network Engineering team will leverage existing automation as well as add new functions to perform "zeroize" operations - manual connections to network devices will no longer be needed. In order to avoid human error, tooling will include two important verifications:
We strive to keep DigitalOcean services available and reliable for our customers who run their businesses and projects on our platform. We apologize for the inconvenience this outage caused and will continue to use the lessons learned to drive improvements in both our systems and processes.