NYC3 Network Connectivity and Event Processing Delays
Incident Report for DigitalOcean

The Incident

At 15:39 UTC on 2017-07-13, a portion of the NYC3 network experienced an outage which resulted in packet loss, latency, and connectivity issues for Droplets within the region. It was caused by a configuration error introduced while the network engineering team was adding more compute capacity.

The infrastructure team at DigitalOcean constantly adds compute capacity to regions to cater for our continued growth. This happens on a very consistent basis, usually with no disruption. One of the first steps of adding more capacity is building out the network infrastructure required to support the new servers. When we do this, we use various automation tools to make the process repeatable and scalable.

We had recently made a minor update to one of our automation tools to further cut down on the amount of manual work required to deploy new switches. The change introduced a bug where duplicate IP addresses were applied on interfaces which link switches together. This meant that a protocol known as Inter Chassis Control Protocol (ICCP) failed to establish. This protocol is required for multi-chassis link aggregation to work correctly. When the new pod was joined into the network, a layer 2 loop was formed due to ICCP not being established. This loop was the cause of the connectivity issues within a specific zone in NYC3. Timeline of Events

15:39 UTC: Configuration applied to bring new pod into production

15:45 UTC: First internal alert received indicating connectivity issues in NYC3; investigation begins from networking team

15:54 UTC: Alerting tools begin paging engineering team members

15:57 UTC: Discussion around incident begins and incident manager is paged

16:06 UTC: Error in new network devices discovered and change is rolled back

16:11 UTC: Alerts begin to clear

16:36 UTC: Status page marked as resolved

Future Measures

We have processes in place which check the status of ICCP (among other things) before adding devices to the network. Ultimately, there were two factors which lead to this outage:

  • A bug was introduced to our automation tool which caused an invalid configuration to be loaded onto multiple devices

  • We did not follow our verification process to ensure new devices were error free before adding them into production

To prevent this type of issue from happening again we are strengthening our process by:

  • Reviewing and testing our automation tools to check for any further bugs

  • Expanding our verification process to include a peer review before adding new devices to the production network

  • Ensuring that any changes to our config automation tools are tested in our lab before being used on production devices

In Conclusion

We want to apologize for the experience you had today. We take any form of negative impact on your service seriously. Our goal is to constantly improve and we hope that our transparency and the details that we have shared help show you that.

Posted 2 months ago. Jul 14, 2017 - 18:45 UTC

Resolved
We've resolved the issue causing delays in event processing and errors using the control panel. If you still notice any problems, please open a ticket for our support team.
Posted 2 months ago. Jul 13, 2017 - 16:36 UTC
Identified
We've deployed a fix for the event processing issue - new events that are submitted should complete normally, and we're continuing to investigate events that haven't finished processing yet. Control panel usage and NYC3 network connectivity should be stable at this time.
Posted 2 months ago. Jul 13, 2017 - 16:24 UTC
Investigating
We're investigating network connectivity issues in the NYC3 region. During this time, you may see packet loss or inability to connect to your Droplets there, delays in event processing for those Droplets, or errors managing Cloud Firewalls or Load Balancers.
Posted 2 months ago. Jul 13, 2017 - 16:09 UTC
This incident affected: Services (Cloud Control Panel, Cloud Firewall, Load Balancers, Networking) and Regions (NYC3).