Cloud control panel connectivity
Incident Report for DigitalOcean
Postmortem

The Incident

On 2017-12-12 at 04:25 UTC, a third-party service that DigitalOcean uses to verify user accounts experienced a major outage. DigitalOcean’s user verification system relies on this service to provide a user “fingerprint”, which determines if the user has previously accessed their account via the current device. Our error handling did not sufficiently handle a complete outage from this vendor, which prevented logins from completing.

Once the issue was identified, we deployed a fix that enabled a fallback to secondary verification systems. Access to the Cloud Control Panel was restored and we believed the fix was successful.

Roughly twenty hours later (2017-12-13 00:33 UTC), the same third-party service experienced another major outage, and we found that our fix the previous night did not fully resolve the issue. In response, we disabled the third-party service integration entirely and started using our fallback verification system as the primary means of validating users. We have since re-enabled the third-party integration, and validated our handling of automatic fallback when the third-party service is unavailable.

Timeline of Events

December 12:

04:25 UTC - Third-party vendor experiences a major outage

04:39 UTC - Vendor updates status page to investigating

05:01 UTC - DigitalOcean support begins receiving tickets related to login failures

05:09 UTC - DigitalOcean engineer is paged to investigate issue

05:21 UTC - DigitalOcean engineer determines the outage is a result of ongoing outage at our vendor; begins work on fix

05:40 UTC - Fix is deployed to production

05:51 UTC - Reports of logins succeeding

December 13:

00:33 UTC - Vendor experiences another major outage

03:26 UTC - Login errors are reported consistent with the previous night’s issues

03:51 UTC - DigitalOcean engineer is paged to investigate

04:00 UTC - DigitalOcean engineer confirms the issue is the same as the previous night, works on completely disabling vendor integration (switching to secondary verification) to restore login usage

05:07 UTC - Fix to remove third party usage is deployed to production

Future Measures

As a result of this incident, we are conducting an audit of our integration with our vendors to ensure that we achieve graceful degradation when they experience partial or complete outages.

In Conclusion

We understand that not being able to log in is seriously detrimental to the usage of the DigitalOcean Control Panel, and for that we sincerely apologize. As we continue to improve our own systems and mitigate our reliance on others, we thank you for your understanding and patience.

Posted 6 months ago. Dec 16, 2017 - 02:45 UTC

Resolved
Our engineering team has resolved the connectivity issues to our Cloud Control Panel. If you are still experiencing issues logging in, please open a ticket with our Support team.
Posted 6 months ago. Dec 14, 2017 - 06:31 UTC
Monitoring
Our engineering team has identified and fixed the connectivity issues to our Cloud Control Panel. We're continuing to monitor the situation. If you are still experiencing issues logging in, please open a ticket with our Support team.
Posted 6 months ago. Dec 14, 2017 - 06:03 UTC
Investigating
Our engineering team is actively investigating connectivity issues to our Cloud Control Panel. At this time, users may experience issues with logging in. We will keep you advised of updates, and apologize for any inconvenience.
Posted 6 months ago. Dec 14, 2017 - 05:27 UTC
This incident affected: Regions (Global) and Services (Cloud Control Panel).