Control Panel Connectivity
Incident Report for DigitalOcean

The Incident

At 18:43 UTC on 2017-07-27, access to our control panel, public API, blog, community website, and API documentation became intermittent due to issues with our main internal traffic proxies. The main cause was an unintended deployment to the active hosts which introduced a concurrency issue and caused the system to fail. We'd like to apologize, share more details about exactly what happened, and talk about how we are working to make sure it doesn't happen again.

A couple of weeks ago, our developers were introducing some new capabilities in our tracing library. The library was deployed to production and hadn’t exhibited any issues. Given there were no issues, a team wanted to expose the new capabilities in a gateway service and decided to deploy the change to production.

When we deploy new changes to this gateway service, our developers will install the release on a passive set of hosts and test it. Once the tests have passed, the code is promoted to the active hosts.

In this case, one of our builds was exhibiting test failures on the passive hosts and we began to debug the issues. There was an error in our deployment process that pushed the bad code to our active hosts. As a result both our active and passive hosts had corrupted code. Because of this, we were unable to rollback or deploy the last stable release. Without a quick rollback we had to undo the code changes in the service and redeploy.

The root cause was tracked down to the tracing library hitting a race condition that caused the system to crash. However, since we were still seeing intermittent issues in production, we temporarily disabled tracing and our Cloud and API services were operating again without error.

Timeline of Events

  • 18:14 UTC: One of our developers noticed that the latest deployment to the passive gateways in production was consistently failing smoke tests and started an investigation

  • 18:43 UTC: Accidental deployment of the bad code was performed against the active gateways in production

  • 18:48 UTC: Operations was notified that the control panel was returning 503 responses on all requests; an investigation was started

  • 19:29 UTC: Gateway changes were reverted and a new build was deployed to production; intermittent problems with Cloud and API were still reported

  • 19:42 UTC: One of our developers traced the crashes to the tracing library and continued to investigate

  • 20:20 UTC: The team deployed a new build to the gateways with tracing disabled and control panel access and APIs were stabilized

Future Measures

There are several concerns to address going forward:

  1. More concurrency/race-condition tests in the CI pipeline to find this defect type earlier.
  2. More controls in the pipeline tooling
    • Preventing accidental deployments
    • Faster rollback of the gateway service in a deployment failure
  3. More controls for enabling/disabling features without needing code deploys

In Conclusion

We’re sorry for the impact this had on your work and business. We wanted to share the specific details around this incident as quickly and accurately as possible to give you insight into what happened and how we handled it. We thank you for your understanding and if there is anything else we can do at this time, please feel free to reach out to us.

Posted 5 months ago. Jul 29, 2017 - 02:28 UTC

Resolved
Our engineering team has fully resolved the gateway issue. If you experience any further problems please reach out to our Support team.
Posted 5 months ago. Jul 27, 2017 - 21:41 UTC
Monitoring
Our engineering team has successfully resolved the gateway issue. You should no longer experience any intermittent connectivity to our services. We’re continuing to monitor this issue in the meantime. If you experience issues please open a ticket with our Support team right away.
Posted 5 months ago. Jul 27, 2017 - 20:41 UTC
Update
Our engineering team is continuing to work through the gateway issue we've been experiencing. The intermittent connectivity to our Control Panel, API, and Community Site will continue until we've fully resolved the issue. Thank you all for your patience; we apologize for the inconvenience and will provide updates as often as possible.
Posted 5 months ago. Jul 27, 2017 - 20:15 UTC
Identified
Our engineering team has identified the issue with the gateways that handle public traffic and are working on resolving the issue as quickly as possible. You will continue to experience partial connectivity to our Control Panel and API services, as well as our Community site. Thank you for your patience. We apologize for the inconvenience and will continue to provide updates.
Posted 5 months ago. Jul 27, 2017 - 19:40 UTC
Monitoring
Our engineering team is actively monitoring a brief connectivity issue to our cloud platform. We will provide more updates shortly. Thank you for your patience and we apologize for the inconvenience.
Posted 5 months ago. Jul 27, 2017 - 18:57 UTC
This incident affected: Services (API, Cloud Control Panel, Community) and Regions (Global).