On 2017-12-12 at 04:25 UTC, a third-party service that DigitalOcean uses to verify user accounts experienced a major outage. DigitalOcean’s user verification system relies on this service to provide a user “fingerprint”, which determines if the user has previously accessed their account via the current device. Our error handling did not sufficiently handle a complete outage from this vendor, which prevented logins from completing.
Once the issue was identified, we deployed a fix that enabled a fallback to secondary verification systems. Access to the Cloud Control Panel was restored and we believed the fix was successful.
Roughly twenty hours later (2017-12-13 00:33 UTC), the same third-party service experienced another major outage, and we found that our fix the previous night did not fully resolve the issue. In response, we disabled the third-party service integration entirely and started using our fallback verification system as the primary means of validating users. We have since re-enabled the third-party integration, and validated our handling of automatic fallback when the third-party service is unavailable.
04:25 UTC - Third-party vendor experiences a major outage
04:39 UTC - Vendor updates status page to investigating
05:01 UTC - DigitalOcean support begins receiving tickets related to login failures
05:09 UTC - DigitalOcean engineer is paged to investigate issue
05:21 UTC - DigitalOcean engineer determines the outage is a result of ongoing outage at our vendor; begins work on fix
05:40 UTC - Fix is deployed to production
05:51 UTC - Reports of logins succeeding
00:33 UTC - Vendor experiences another major outage
03:26 UTC - Login errors are reported consistent with the previous night’s issues
03:51 UTC - DigitalOcean engineer is paged to investigate
04:00 UTC - DigitalOcean engineer confirms the issue is the same as the previous night, works on completely disabling vendor integration (switching to secondary verification) to restore login usage
05:07 UTC - Fix to remove third party usage is deployed to production
As a result of this incident, we are conducting an audit of our integration with our vendors to ensure that we achieve graceful degradation when they experience partial or complete outages.