At approximately 06:10 UTC on October 3, 2017, we began to see unexpected increased load on the core networking gear in NYC2. Network engineers investigated and determined that an increase in control plane loads on core network equipment was likely causing the intermittent connectivity issues throughout NYC2.
Our engineering team began troubleshooting to identify the cause of the increased load, its origins, and ways to mitigate it. By Wednesday, all standard troubleshooting techniques and mitigations were exhausted and we engaged the hardware vendor to provide further assistance. On Wednesday evening, some minor configuration changes recommended by the vendor were implemented, but ultimately these were unsuccessful and further diagnosis into the root cause of the problem continued.
On Wednesday evening into Thursday, behaviors exhibited by the networking devices were determined to be inconsistent with how the hardware platform should operate under normal operating conditions, and we decided to perform a software upgrade to a newer release. While the software upgrade was not performed to mitigate the root problems, it was a required step to get the networking devices to a baseline state where their behavior was consistent with how the hardware should operate. This also provided visibility into some metrics not available in the older software release.
After the software upgrade was completed, efforts to isolate the root problems continued and were ultimately successful. Some specific network traffic that occurs under normal operating conditions had increased by a significant factor resulting in the increased control plane load triggering the problems. Once this traffic had been identified, changes were made to isolate it and mitigation of the root problems was successful on Friday.
Note: timeline events are collated and provided in 6-hour intervals
Oct 3, 2017: 06:10 UTC: Initial problem triggers and increase in control plane utilization is logged in monitoring systems 12:00 UTC: Support teams have identified intermittent connectivity problems in NYC2; Customers have begun reporting intermittent connectivity problems in NYC2; Network engineering team has been engaged to investigate reports 18:00 UTC: Network engineering has identified the increased control plane utilization on some core networking devices; Network engineering continues to troubleshoot based on symptom reports, no identified evidence of issue yet determined
Oct 4, 2017: 00:00 UTC: Network engineering continues to troubleshoot based on symptom reports, no identified evidence of issue yet determined. 06:00 UTC: Network engineering continues to troubleshoot based on symptom reports, no identified evidence of issue yet determined. 12:00 UTC: Network engineering has identified inconsistent behavior of certain control plane functions on core networking devices when handling specific types of network traffic; Issue is now able to be easily reproduced 18:00 UTC: Once all normal avenues of troubleshooting had been exhausted the network engineering team engaged the hardware vendor with a priority 1 technical case
Oct 5, 2017: 00:00 UTC: Hardware vendor recommends some configuration changes to try and mitigate the issue; Network engineering implements the configuration changes but the issue persists; Troubleshooting with the hardware vendor continues 06:00 UTC: Troubleshooting with the hardware vendor continues 12:00 UTC: Hardware vendor identifies inconsistent behavior with how the hardware platform should perform; Troubleshooting with the hardware vendor continues and is now focused on identifying the inconsistent hardware behavior as a key contributor to the issue 18:00 UTC: Hardware vendor and network engineering agree to perform an immediate software release upgrade on core networking devices to address the inconsistent platform behavior; Network engineering commences preparations to perform software release upgrade
Oct 6, 2017: 00:00 UTC: Preparations for software release upgrade continue in parallel with continued troubleshooting activities with the hardware vendor 06:00 UTC: Software release upgrade is performed 12:00 UTC: Troubleshooting activities continue with certain network traffic starting to be identified as likely root cause 16:45 UTC: Configuration changes made to isolate network traffic contributing to increased control plane load resolving the issue.
There are a couple key takeaways that would have helped reduce the impact and duration of this issue:
We are very disappointed this incident lasted as long as it did and are sincerely sorry for the inconveniences and frustrations it caused our users.