The Incident
On August 1st, 2018, an outage caused errors for users attempting to access the Cloud Control Panel. DigitalOcean’s customer bills are generated on the first of the month, and a high number of billing detail requests from users was a major contributor to the outage. Upon discovery of the first outage, the Billing Engineering team was notified to investigate the failing service. In total, 76 minutes of downtime spanned across the period from 15:12 UTC to 22:36 UTC in intervals of 12-24 minutes.
One of the internal services that handles these requests could not accommodate the request load and failed as it reached resource limits. This service then began to return TLS errors that subsequently took down all access to the Cloud Control Panel. The internal service automatically recovered after each failure, but would again hit the resource limits, failing and limiting web access, and ultimately resulting in intervals of downtime.
Timeline of Events
19:12 UTC - Cloud Control Panel outage begins
19:15 UTC - Engineering teams are alerted of the outage and begin investigation
19:17 UTC - Cloud Control Panel recovers; Engineering teams continue to investigate the issue
20:10 UTC - Engineering teams determine the root cause of the outage
20:49 UTC - Engineering teams deploy a fix to isolate the problem to the Billing area of the Cloud Control Panel
22:08 UTC - Cloud Control Panel outage reoccurs, but this time outage only impacts the Billing area
22:32 UTC - Outage to Billing area of Control Panel ends
02:30 UTC - Engineering teams deploy additional changes to increase resources and infrastructure allocation to prevent another outage while full root cause is remediated over the next few days
02:30 UTC - Incident formally declared as resolved
Future Measures
The team is committed to making improvements that better enable our systems to proactively detect and address load increases and limit scope and visibility to only the affected sections of the Cloud Control Panel. To avoid similar issues from occurring in the future, we are increasing memory and resiliency of Cloud Control Panel services to handle larger and more frequent requests and, where possible, optimizing queries to lower load and cache.
In Conclusion
We know that service interruptions are frustrating for our users and we sincerely apologize for the inconvenience caused by this outage.