Cloud Control Panel Errors
Incident Report for DigitalOcean
Postmortem

The Incident

On August 1st, 2018, an outage caused errors for users attempting to access the Cloud Control Panel. DigitalOcean’s customer bills are generated on the first of the month, and a high number of billing detail requests from users was a major contributor to the outage. Upon discovery of the first outage, the Billing Engineering team was notified to investigate the failing service. In total, 76 minutes of downtime spanned across the period from 15:12 UTC to 22:36 UTC in intervals of 12-24 minutes.

One of the internal services that handles these requests could not accommodate the request load and failed as it reached resource limits. This service then began to return TLS errors that subsequently took down all access to the Cloud Control Panel. The internal service automatically recovered after each failure, but would again hit the resource limits, failing and limiting web access, and ultimately resulting in intervals of downtime.

Timeline of Events

19:12 UTC - Cloud Control Panel outage begins

19:15 UTC - Engineering teams are alerted of the outage and begin investigation

19:17 UTC - Cloud Control Panel recovers; Engineering teams continue to investigate the issue

20:10 UTC - Engineering teams determine the root cause of the outage

20:49 UTC  - Engineering teams deploy a fix to isolate the problem to the Billing area of the Cloud Control Panel

22:08 UTC - Cloud Control Panel outage reoccurs, but this time outage only impacts the Billing area

22:32 UTC - Outage to Billing area of Control Panel ends

02:30 UTC - Engineering teams deploy additional changes to increase resources and infrastructure allocation to prevent another outage while full root cause is remediated over the next few days

02:30 UTC - Incident formally declared as resolved

Future Measures

The team is committed to making improvements that better enable our systems to proactively detect and address load increases and limit scope and visibility to only the affected sections of the Cloud Control Panel. To avoid similar issues from occurring in the future, we are increasing memory and resiliency of Cloud Control Panel services to handle larger and more frequent requests and, where possible, optimizing queries to lower load and cache.

In Conclusion

We know that service interruptions are frustrating for our users and we sincerely apologize for the inconvenience caused by this outage.

Posted about 1 month ago. Aug 07, 2018 - 17:50 UTC

Resolved
Our engineering team has resolved the issue with accessing the Billing area. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted about 2 months ago. Aug 02, 2018 - 13:38 UTC
Monitoring
Our engineering team has confirmed that the fix implemented last night has mitigated errors received when users attempted to access the Billing area. We are monitoring the situation and we will post an update as soon as the issue is fully resolved.
Posted about 2 months ago. Aug 02, 2018 - 12:55 UTC
Update
Our engineering team has implemented a fix to stabilize the intermittent errors some users experienced when attempting to access the Billing area in the Cloud Control Panel. We are continuing to monitor the situation throughout the night, and will post an update in the morning (ET) once we can confirm full resolution.
Posted about 2 months ago. Aug 02, 2018 - 01:37 UTC
Update
Our engineering team is working to resolve the issue causing intermittent errors for some users attempting to access the Billing area in the Cloud Control Panel. We appreciate your patience and will share additional updates as they become available.
Posted about 2 months ago. Aug 02, 2018 - 00:57 UTC
Identified
While our engineering team has implemented a fix to resolve the issue causing errors on the Cloud Control Panel, we have identified a related issue causing errors for some users attempting to access or view the Billing area. We are working towards a solution and will share additional updates as they become available.
Posted about 2 months ago. Aug 01, 2018 - 22:54 UTC
Monitoring
Our engineering team has implemented a fix to resolve the issues affecting Cloud Control Panel, and is monitoring the situation. We will post an update as soon as the issue is fully resolved.
Posted about 2 months ago. Aug 01, 2018 - 21:58 UTC
Identified
Our engineering team has identified the issue causing errors within Cloud Control Panel, and is working towards a solution. We will share additional updates as they become available.
Posted about 2 months ago. Aug 01, 2018 - 21:29 UTC
Investigating
Our engineering team is investigating issues related to the Cloud Control Panel. During this time, users may receive error messages when attempting to access or perform actions within the Cloud Control Panel. We apologize for the inconvenience and will share an update once we have more information.
Posted about 2 months ago. Aug 01, 2018 - 20:43 UTC
This incident affected: Regions (Global) and Services (Cloud Control Panel).