Networking Issues in NYC1 Affecting Multiple Services
Incident Report for DigitalOcean
Postmortem

The Incident

On the evening of Wednesday, June 26th at 12:01am Eastern time, planned maintenance operations commenced on the core switches in our NYC1 datacenter to replace and add hardware which would provide additional network ports. This work was conducted as part of our ongoing efforts to maintain sufficient capacity. During this work, we determined that one of the newly installed cards had a hardware fault. In order to proceed with maintenance, some existing connections were moved to different network hardware, which meant the number of connections between the two switches - which are used to maintain high availability state and other network information - was reduced from four to two. This is still a proper supported configuration however it provides reduced redundancy. We opened a case with the vendor and expected replacement hardware within 24 hours. The planned maintenance activity completed without impact.

On the following morning, Wednesday June 26th, at approximately 7:20 AM eastern time we began to experience packet loss in our NYC1 datacenter. Our Engineering team began investigating the issue and identified a bug in the version of firmware running on the switches. This bug was more visible due to the reduced redundancy configuration in place. Engineers attempted to restart portions of the switch hardware that were reporting errors which resolved most of the packet loss; however, there were some residual impacts (some destinations reporting problems) and engineers were still noting errors in logs. To mitigate the impact we upgraded the firmware to a version that included fixes for this bug. Upgrades commenced one switch at a time, with a momentary loss of connectivity while each device of the pair came back online. Once the upgrade of the second switch was complete and fully booted, impacts from the event were fully mitigated.

The faulty hardware was replaced the next day using in-service procedures without impact.

During the primary downtime event, connectivity between Droplets, between Droplets and the Internet, and the Cloud Control Panel and API functionality was intermittently impacted, with some periods of complete unreachability.

Timeline of Events

11:20 UTC - Our team is initially alerted to the issue
11:30 UTC - Engineering team begins investigation

11:52 UTC - Root cause is identified and troubleshooting efforts begin

12:12-13:45 - Initial mitigation attempts begin, with incomplete results

14:10 UTC - Upgrade begins on first network device

14:48 UTC - First network device fully booted with new firmware and routing converged

14:56 UTC - Second device upgrade begins

15:22 UTC - Second network device fully booted with new firmware and routing converged

15:23 UTC - Internal alarms begin clearing, impact is over and the incident is resolved

Future Measures

Our Network Engineering team is investigating additional steps and checks that can be included in our existing change plans for work done during maintenance windows. Most notably, we will be looking at defining requirements to consider a change plan successful, versus what would require a rollback of the change, and to make informed risk decisions ahead of time rather than in the midst of the activity.

In Conclusion

We recognize the impact these networking issues had on our customers and we sincerely apologize for the frustrations and inconvenience.

Posted 11 days ago. Jul 09, 2019 - 17:43 UTC

Resolved
Our engineering team has resolved the issues that were resulting from the faulty network card, and impact is now resolved. All services should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted 24 days ago. Jun 26, 2019 - 17:04 UTC
Monitoring
Our engineering team has now completed the router code upgrade to help restore stability and resolve the issues due to the faulting network card. We appreciate your patience and will post an update as soon as the issue is fully resolved.
Posted 24 days ago. Jun 26, 2019 - 15:50 UTC
Update
Our engineering team has identified and remediated the immediate issues, which were discovered to be related to a faulting network card. However, we continue to see alerts that indicate there are still intermittent issues with stability. Our engineering team is now executing a router code upgrade that we believe will address the issues and allow us to fully restore service. During this upgrade, users can expect to see brief connectivity issues and interrupted traffic flows. We apologize for the inconvenience and will post another update once the upgrade has completed.
Posted 24 days ago. Jun 26, 2019 - 14:22 UTC
Identified
Our engineering team has identified the cause of the issue with networking in our NYC1 region that is currently causing connectivity issues and impacting multiple services, including our Cloud Control Panel and API, and is working on a fix. We will post an update as additional information is available.
Posted 24 days ago. Jun 26, 2019 - 13:31 UTC
Update
Our engineering team continues to investigate the ongoing issues. a networking issue in our NYC1 region. During this time, you may experience intermittent packet loss or increased latency, as well as issues with creating and accessing services and accounts from the Cloud Control Panel or API. We apologize for the inconvenience and will share an update once we have more information.
Posted 24 days ago. Jun 26, 2019 - 12:11 UTC
Investigating
Our engineering team is investigating a networking issue in our NYC1 region. During this time, you may experience intermittent packet loss or increased latency. We apologize for the inconvenience and will share an update once we have more information.
Posted 24 days ago. Jun 26, 2019 - 11:48 UTC
This incident affected: Regions (Global, NYC1) and Services (API, Cloud Control Panel, Networking).