DigitalOcean Services Status

Managed Kubernetes Cluster in FRA1
Incident Report for DigitalOcean
Postmortem

Incident Summary

On January 23, 2024 at 12:30 PM UTC, we suffered an outage with an internal infrastructure cluster that powered a subset of DOKS clusters in the fra1 region. Specifically, an incorrect configuration was inadvertently applied to the infrastructure cluster during a maintenance operation, which then triggered a cascading event leading to the automatic deletion of the involved DOKS clusters. As a consequence, the control planes became inaccessible, and control plane data loss occurred.

Incident Details

  • Root Cause: An incorrect configuration change to the production infrastructure during the maintenance effort.
  • Impact: The configuration change triggered an automatic deletion process on the underlying compute resources hosting the control planes of the DOKS clusters. As a result, the impacted DOKS clusters lost their control planes.
  • Response: All DOKS etcd clusters are backed up twice a day with a randomized schedule for each cluster. The recovery process involved restoring the etcd cluster from the latest available backup for the cluster. In the worst case, this caused Kubernetes control plane data to be outdated for up to 12h. Once etcd was recovered, the stateless control plane components of the cluster were also restored, thus completing the recovery of the control plane. Existing data plane workloads in the cluster continued to operate for the clusters until the recovery was complete.

Timeline of Events

Jan 23, 2024

12:21 - The incorrect configuration change is applied to the FRA1 internal cluster and triggers the automatic control plane deletion.

12:30 - The automatic deletion process removed all the control planes of the DOKS clusters from the specific infrastructure cluster.

12:36 - The incorrect configuration was identified as the root cause, preparations for recovery begin.

12:48 - Internal incident response kicks off to identify impacted clusters and track the recovery process.13:00 - Unexpected problems during the recovery process are encountered, delaying the full recovery and requiring additional patching of the recovery mechanism.

15:00 - The recovery problems are fully addressed.

20:31 - All impacted HA DOKS clusters are recovered.

21:21 - All impacted DOKS clusters are recovered.

Remediation Actions

As of this week, a fix to prevent the automatic deletion of infrastructure cluster resources under any circumstances in the production environment has been rolled out.

Posted Feb 02, 2024 - 01:25 UTC

Resolved
Our Engineering team has completed mitigation efforts for the issue impacting Managed Kubernetes in the FRA1 region and we are marking this incident as Resolved.

At this time, functionality to impacted clusters has been restored but customers may need to reconfigure some Kubernetes resources. Customer Support is contacting impacted customers directly with further instructions.

If you have any questions or concerns regarding this incident, please open a ticket with our support team.
Posted Jan 23, 2024 - 22:50 UTC
Update
Our Engineering team continues to work on mitigation efforts. An additional small bug has been discovered and remediated. About 10% of clusters have had accessibility restored and restoration efforts are ongoing.

We will post another update as soon as we have new developments.

Thank you for your patience and we apologize for any inconvenience.
Posted Jan 23, 2024 - 18:44 UTC
Identified
Our Engineering team has identified the cause of the issue with Managed Kubernetes clusters in the FRA1 region. 200 clusters are impacted by the issue and remain inaccessible to users at this time.

Our Engineering team is engaged in remediating these clusters to restore accessibility. As soon as we are able to provide an estimated time to restore, we will provide an update.
Posted Jan 23, 2024 - 15:12 UTC
Investigating
As of 12:18 UTC, our Engineering team is investigating an issue with Kubernetes clusters in the FRA1 region. During this time, users may experience errors while communicating with their clusters in the FRA1 region.

We apologize for the inconvenience and will share an update once we have more information.
Posted Jan 23, 2024 - 13:02 UTC
This incident affected: Kubernetes (FRA1).