DNS Issues in LON1 Region
Incident Report for DigitalOcean
Postmortem

On January 28, 2021 at 12:10 UTC, an issue occurred in DigitalOcean’s LON1 data center, which impacted connectivity for the servers that provide DNS resolution services within this data center.

During a planned maintenance to upgrade the operating system on a subset of top-of-rack network switches in the LON1 data center, the network team isolated the switches to be acted upon as part of the normal maintenance procedures. These switches are paired with a redundant set, which provides for highly available connectivity to the servers in those racks.

During the isolation process, alerts indicated that servers which provide DNS resolution services within LON1 became unavailable. These servers are connected to the switches that were being acted upon, but the connectivity impact was unexpected as connectivity should have been provided via the redundant switches (which were not being upgraded as a part of this maintenance).

The network engineer performing the maintenance immediately engaged with our Cloud Operations team to troubleshoot the issue. In order to mitigate the impact to services, a rollback was begun at 12:14 UTC and the maintenance was scrubbed. At 12:16 UTC, DNS services were partially restored, and at 12:19, they were fully restored.

While physical connectivity between the servers and each of the redundant switches was verified before the maintenance, a misconfiguration on the servers prevented them from utilizing the redundant switch.

As a part of the analysis for this issue, DigitalOcean has identified a number of corrective actions to ensure that similar issues do not recur:

  • Correct the configuration on the affected DNS servers and audit other servers to ensure they have valid configurations.
  • Enhance existing automation to verify reachability (in addition to physical connectivity) for servers via both switches, and update maintenance procedures to perform this verification prior to the isolation step.
  • Update maintenance procedures to reduce risk by avoiding actions upon devices in multiple critical infrastructure racks simultaneously.
Posted Feb 08, 2021 - 20:14 UTC

Resolved
Between 12:02 and 12:16 UTC, our engineering team observed issues impacting DNS resolution in our LON1 region. During this time, some users may have experienced errors when interacting with resources in LON1(Droplet, Managed Kubernetes, Managed Databases), and also with resolving DNS/domains from within these resources. Our team was able to take quick action to mitigate the impact and resolve the issue, and all services in the LON1 region are now functioning normally. Thank you for your patience, and we apologize for any inconvenience. If you are still experiencing issues or have additional questions, please open a Support ticket right away.
Posted Jan 28, 2021 - 05:00 UTC