At 18:43 UTC on 2017-07-27, access to our control panel, public API, blog, community website, and API documentation became intermittent due to issues with our main internal traffic proxies. The main cause was an unintended deployment to the active hosts which introduced a concurrency issue and caused the system to fail. We'd like to apologize, share more details about exactly what happened, and talk about how we are working to make sure it doesn't happen again.
A couple of weeks ago, our developers were introducing some new capabilities in our tracing library. The library was deployed to production and hadn’t exhibited any issues. Given there were no issues, a team wanted to expose the new capabilities in a gateway service and decided to deploy the change to production.
When we deploy new changes to this gateway service, our developers will install the release on a passive set of hosts and test it. Once the tests have passed, the code is promoted to the active hosts.
In this case, one of our builds was exhibiting test failures on the passive hosts and we began to debug the issues. There was an error in our deployment process that pushed the bad code to our active hosts. As a result both our active and passive hosts had corrupted code. Because of this, we were unable to rollback or deploy the last stable release. Without a quick rollback we had to undo the code changes in the service and redeploy.
The root cause was tracked down to the tracing library hitting a race condition that caused the system to crash. However, since we were still seeing intermittent issues in production, we temporarily disabled tracing and our Cloud and API services were operating again without error.
18:14 UTC: One of our developers noticed that the latest deployment to the passive gateways in production was consistently failing smoke tests and started an investigation
18:43 UTC: Accidental deployment of the bad code was performed against the active gateways in production
18:48 UTC: Operations was notified that the control panel was returning 503 responses on all requests; an investigation was started
19:29 UTC: Gateway changes were reverted and a new build was deployed to production; intermittent problems with Cloud and API were still reported
19:42 UTC: One of our developers traced the crashes to the tracing library and continued to investigate
20:20 UTC: The team deployed a new build to the gateways with tracing disabled and control panel access and APIs were stabilized
There are several concerns to address going forward:
We’re sorry for the impact this had on your work and business. We wanted to share the specific details around this incident as quickly and accurately as possible to give you insight into what happened and how we handled it. We thank you for your understanding and if there is anything else we can do at this time, please feel free to reach out to us.