On January 28, 2021 at 12:10 UTC, an issue occurred in DigitalOcean’s LON1 data center, which impacted connectivity for the servers that provide DNS resolution services within this data center.
During a planned maintenance to upgrade the operating system on a subset of top-of-rack network switches in the LON1 data center, the network team isolated the switches to be acted upon as part of the normal maintenance procedures. These switches are paired with a redundant set, which provides for highly available connectivity to the servers in those racks.
During the isolation process, alerts indicated that servers which provide DNS resolution services within LON1 became unavailable. These servers are connected to the switches that were being acted upon, but the connectivity impact was unexpected as connectivity should have been provided via the redundant switches (which were not being upgraded as a part of this maintenance).
The network engineer performing the maintenance immediately engaged with our Cloud Operations team to troubleshoot the issue. In order to mitigate the impact to services, a rollback was begun at 12:14 UTC and the maintenance was scrubbed. At 12:16 UTC, DNS services were partially restored, and at 12:19, they were fully restored.
While physical connectivity between the servers and each of the redundant switches was verified before the maintenance, a misconfiguration on the servers prevented them from utilizing the redundant switch.
As a part of the analysis for this issue, DigitalOcean has identified a number of corrective actions to ensure that similar issues do not recur: