At 15:39 UTC on 2017-07-13, a portion of the NYC3 network experienced an outage which resulted in packet loss, latency, and connectivity issues for Droplets within the region. It was caused by a configuration error introduced while the network engineering team was adding more compute capacity.
The infrastructure team at DigitalOcean constantly adds compute capacity to regions to cater for our continued growth. This happens on a very consistent basis, usually with no disruption. One of the first steps of adding more capacity is building out the network infrastructure required to support the new servers. When we do this, we use various automation tools to make the process repeatable and scalable.
We had recently made a minor update to one of our automation tools to further cut down on the amount of manual work required to deploy new switches. The change introduced a bug where duplicate IP addresses were applied on interfaces which link switches together. This meant that a protocol known as Inter Chassis Control Protocol (ICCP) failed to establish. This protocol is required for multi-chassis link aggregation to work correctly. When the new pod was joined into the network, a layer 2 loop was formed due to ICCP not being established. This loop was the cause of the connectivity issues within a specific zone in NYC3. Timeline of Events
15:39 UTC: Configuration applied to bring new pod into production
15:45 UTC: First internal alert received indicating connectivity issues in NYC3; investigation begins from networking team
15:54 UTC: Alerting tools begin paging engineering team members
15:57 UTC: Discussion around incident begins and incident manager is paged
16:06 UTC: Error in new network devices discovered and change is rolled back
16:11 UTC: Alerts begin to clear
16:36 UTC: Status page marked as resolved
We have processes in place which check the status of ICCP (among other things) before adding devices to the network. Ultimately, there were two factors which lead to this outage:
A bug was introduced to our automation tool which caused an invalid configuration to be loaded onto multiple devices
We did not follow our verification process to ensure new devices were error free before adding them into production
To prevent this type of issue from happening again we are strengthening our process by:
Reviewing and testing our automation tools to check for any further bugs
Expanding our verification process to include a peer review before adding new devices to the production network
Ensuring that any changes to our config automation tools are tested in our lab before being used on production devices
We want to apologize for the experience you had today. We take any form of negative impact on your service seriously. Our goal is to constantly improve and we hope that our transparency and the details that we have shared help show you that.