On the evening of Wednesday, June 26th at 12:01am Eastern time, planned maintenance operations commenced on the core switches in our NYC1 datacenter to replace and add hardware which would provide additional network ports. This work was conducted as part of our ongoing efforts to maintain sufficient capacity. During this work, we determined that one of the newly installed cards had a hardware fault. In order to proceed with maintenance, some existing connections were moved to different network hardware, which meant the number of connections between the two switches - which are used to maintain high availability state and other network information - was reduced from four to two. This is still a proper supported configuration however it provides reduced redundancy. We opened a case with the vendor and expected replacement hardware within 24 hours. The planned maintenance activity completed without impact.
On the following morning, Wednesday June 26th, at approximately 7:20 AM eastern time we began to experience packet loss in our NYC1 datacenter. Our Engineering team began investigating the issue and identified a bug in the version of firmware running on the switches. This bug was more visible due to the reduced redundancy configuration in place. Engineers attempted to restart portions of the switch hardware that were reporting errors which resolved most of the packet loss; however, there were some residual impacts (some destinations reporting problems) and engineers were still noting errors in logs. To mitigate the impact we upgraded the firmware to a version that included fixes for this bug. Upgrades commenced one switch at a time, with a momentary loss of connectivity while each device of the pair came back online. Once the upgrade of the second switch was complete and fully booted, impacts from the event were fully mitigated.
The faulty hardware was replaced the next day using in-service procedures without impact.
During the primary downtime event, connectivity between Droplets, between Droplets and the Internet, and the Cloud Control Panel and API functionality was intermittently impacted, with some periods of complete unreachability.
11:20 UTC - Our team is initially alerted to the issue
11:30 UTC - Engineering team begins investigation
11:52 UTC - Root cause is identified and troubleshooting efforts begin
12:12-13:45 - Initial mitigation attempts begin, with incomplete results
14:10 UTC - Upgrade begins on first network device
14:48 UTC - First network device fully booted with new firmware and routing converged
14:56 UTC - Second device upgrade begins
15:22 UTC - Second network device fully booted with new firmware and routing converged
15:23 UTC - Internal alarms begin clearing, impact is over and the incident is resolved
Our Network Engineering team is investigating additional steps and checks that can be included in our existing change plans for work done during maintenance windows. Most notably, we will be looking at defining requirements to consider a change plan successful, versus what would require a rollback of the change, and to make informed risk decisions ahead of time rather than in the midst of the activity.
We recognize the impact these networking issues had on our customers and we sincerely apologize for the frustrations and inconvenience.