Intermittent NYC2 Connectivity Issues
Incident Report for DigitalOcean

The Incident

At approximately 06:10 UTC on October 3, 2017, we began to see unexpected increased load on the core networking gear in NYC2. Network engineers investigated and determined that an increase in control plane loads on core network equipment was likely causing the intermittent connectivity issues throughout NYC2.

Our engineering team began troubleshooting to identify the cause of the increased load, its origins, and ways to mitigate it. By Wednesday, all standard troubleshooting techniques and mitigations were exhausted and we engaged the hardware vendor to provide further assistance. On Wednesday evening, some minor configuration changes recommended by the vendor were implemented, but ultimately these were unsuccessful and further diagnosis into the root cause of the problem continued.

On Wednesday evening into Thursday, behaviors exhibited by the networking devices were determined to be inconsistent with how the hardware platform should operate under normal operating conditions, and we decided to perform a software upgrade to a newer release. While the software upgrade was not performed to mitigate the root problems, it was a required step to get the networking devices to a baseline state where their behavior was consistent with how the hardware should operate. This also provided visibility into some metrics not available in the older software release.

After the software upgrade was completed, efforts to isolate the root problems continued and were ultimately successful. Some specific network traffic that occurs under normal operating conditions had increased by a significant factor resulting in the increased control plane load triggering the problems. Once this traffic had been identified, changes were made to isolate it and mitigation of the root problems was successful on Friday.

Timeline of Events

Note: timeline events are collated and provided in 6-hour intervals

Oct 3, 2017: 06:10 UTC: Initial problem triggers and increase in control plane utilization is logged in monitoring systems 12:00 UTC: Support teams have identified intermittent connectivity problems in NYC2; Customers have begun reporting intermittent connectivity problems in NYC2; Network engineering team has been engaged to investigate reports 18:00 UTC: Network engineering has identified the increased control plane utilization on some core networking devices; Network engineering continues to troubleshoot based on symptom reports, no identified evidence of issue yet determined

Oct 4, 2017: 00:00 UTC: Network engineering continues to troubleshoot based on symptom reports, no identified evidence of issue yet determined. 06:00 UTC: Network engineering continues to troubleshoot based on symptom reports, no identified evidence of issue yet determined. 12:00 UTC: Network engineering has identified inconsistent behavior of certain control plane functions on core networking devices when handling specific types of network traffic; Issue is now able to be easily reproduced 18:00 UTC: Once all normal avenues of troubleshooting had been exhausted the network engineering team engaged the hardware vendor with a priority 1 technical case

Oct 5, 2017: 00:00 UTC: Hardware vendor recommends some configuration changes to try and mitigate the issue; Network engineering implements the configuration changes but the issue persists; Troubleshooting with the hardware vendor continues 06:00 UTC: Troubleshooting with the hardware vendor continues 12:00 UTC: Hardware vendor identifies inconsistent behavior with how the hardware platform should perform; Troubleshooting with the hardware vendor continues and is now focused on identifying the inconsistent hardware behavior as a key contributor to the issue 18:00 UTC: Hardware vendor and network engineering agree to perform an immediate software release upgrade on core networking devices to address the inconsistent platform behavior; Network engineering commences preparations to perform software release upgrade

Oct 6, 2017: 00:00 UTC: Preparations for software release upgrade continue in parallel with continued troubleshooting activities with the hardware vendor 06:00 UTC: Software release upgrade is performed 12:00 UTC: Troubleshooting activities continue with certain network traffic starting to be identified as likely root cause 16:45 UTC: Configuration changes made to isolate network traffic contributing to increased control plane load resolving the issue.

Future Measures

There are a couple key takeaways that would have helped reduce the impact and duration of this issue:

  • Re-evaluate our processes around tracking to the vendor software release cycle.
  • Continue to increase our monitoring and visibility into the network. We have performed additional work to increase monitored parameters since this incident began.

In Conclusion

We are very disappointed this incident lasted as long as it did and are sincerely sorry for the inconveniences and frustrations it caused our users.

Posted 12 days ago. Oct 10, 2017 - 23:05 UTC

Resolved
At this time our engineering team has determined that the fix put in place has resolved the issue. If you are still noticing networking trouble in NYC2, please open a ticket to let us know. We will be publishing a post-mortem once the full investigation process has been completed.
Posted 16 days ago. Oct 06, 2017 - 20:56 UTC
Monitoring
Our engineering team has isolated the cause of the networking issues and put a fix in place believed to prevent these performances issues from occurring in the future. We will continue to monitor network traffic until we confirm the issue is resolved.
Posted 16 days ago. Oct 06, 2017 - 18:38 UTC
Update
The NYC2 maintenance is causing temporary, unexpected disruption to other services on the platform. Users may experience brief connectivity issues with the Cloud Panel and API, and delays in event processing

We are still working on this issue and will continue to provide updates as we work through the problem.
Posted 16 days ago. Oct 06, 2017 - 14:27 UTC
Update
The NYC2 maintenance is causing temporary, unexpected disruption to other services on the platform. Users may experience brief connectivity issues with the Cloud Panel and API, and delays in event processing

We will continue to provide updates as we work through the problem.
Posted 16 days ago. Oct 06, 2017 - 05:20 UTC
Update
The NYC2 maintenance is causing temporary, unexpected disruption to other services on the platform. Users may experience brief connectivity issues with the Cloud Panel and API.

Events are currently disabled.

We will continue to provide updates as we work through the problem.
Posted 16 days ago. Oct 06, 2017 - 04:58 UTC
Update
The NYC2 maintenance is causing temporary, unexpected disruption to other services on the platform. Users may experience brief connectivity issues with the Cloud Panel and API, as well as delayed events.

We will continue to provide updates as we work through the problem.
Posted 16 days ago. Oct 06, 2017 - 04:00 UTC
Update
As part of our ongoing efforts to mitigate the issues in NYC2, our engineering team is planning emergency maintenance from 2-10 AM UTC, Friday, October 6th. While this takes place, customers may experience a brief 2-minute spike in latency and packet loss while traffic is re-routed. Once complete, we'll update the status page with a notice.

We recognize and apologize for the miscommunication so far on our end in regards to this problem. Initial indications after our mitigation yesterday were good, but that turned out to not be the case after further investigation.

We always aim for transparency on the platform and will work to better report on issues like this in the future.
Posted 17 days ago. Oct 05, 2017 - 22:53 UTC
Update
The engineering team is still narrowing down the scope of the issue. Our previous mitigation steps are helping, but some customers are continuing to see connectivity issues to our services. We will continue to update the community as we further resolve the problem.
Posted 17 days ago. Oct 05, 2017 - 20:55 UTC
Identified
Networking believes they have identified the issue and have put mitigations in place to minimize impacts while the longer term resolution is being worked on. Customers may continue to see decreased networking and API performance until the permanent fix has been put in place. We will send out additional communication in regards to the next steps that will be taken to address this issue.
Posted 17 days ago. Oct 05, 2017 - 14:22 UTC
This incident affected: Regions (Global, NYC2) and Services (Event Processing, Networking).