DigitalOcean Services Status

Managed Kubernetes Service
Incident Report for DigitalOcean
Postmortem

Incident Summary

Beginning October 8th around 06:00 UTC, we experienced an issue with a subset of Managed Kubernetes clusters having their nodes go into NotReady status, causing services and applications on these nodes to be disrupted. Due to an unexpected auto-upgrade process on the Debian image used by DigitalOcean, customers experienced downtime on worker nodes starting between 06:00 and 07:00 UTC on October 8th, 9th, and 10th. This ended up affecting users of Managed Kubernetes on cluster versions 1.27.6-do.0 and 1.28.2-do.0 across multiple regions. 

Incident Details

Root Cause

The root cause for this incident was found to be a systemd timer in the DigitalOcean Debian image which automatically kicked off an apt-upgrade at 06:00 UTC (with a 60-minute randomized delay). The upgrade brought in unexpected/unsupported versions of various components, including systemd and the kernel on DOKS cluster versions 1.27.6-do.0 and 1.28.2-do.0. The kernel upgrade in particular could cause the upgrade process to take longer than 15 minutes, at which point systemd would abort the process ungracefully. This cancellation could leave worker nodes in a broken state.

Impact

The timeout leading to the upgrade process being canceled led to a disruption in both public and private network connectivity for the worker nodes in the majority of cases. Customers experienced their worker nodes going into NotReady status and becoming unavailable at the time of the upgrade, remaining unavailable until the nodes were replaced or rebooted. In some cases, a single replacement/reboot was not sufficient for recovery and multiple were required. This impact recurred twice more after the first event, around 06:00 UTC on October 9th and 10th. 

Response

The initial triaging done by the Containers Engineering team quickly scoped the problem down to the worker nodes. Unfortunately, the complete network failures dominantly faced on the affected machines impaired efforts to log into the worker nodes for in-depth troubleshooting. Managed Kubernetes worker nodes come with limited accessibility to Containers Engineering for security reasons, which required the team to provide machine access via SSH and the console while the impact was not occurring. This prolonged the time until in-depth investigations could commence.

Once nodes became accessible, it became apparent that processes running on a given machine failed to connect to remote destinations both over the VPC and the public network. Early investigations focused on trying to understand which section of the data path may have been disrupted and why. This involved running several traces across the worker nodes and the corresponding hypervisors. The results surfaced broken connectivity at the guest OS level, primarily manifesting as response packets being delivered to the machine successfully but not further relayed to the originating processes (including kubelet and workload pods). This ruled out the data path as the source of problems.

At this point, customers were actively advised to try to replace or reboot machines in order to mitigate the impact.

The investigation focus then pivoted over to the periodic aspect of the incident, which had become more clearly apparent in subsequent occurrences, always starting around the same time of the day (06:00 UTC). The limited impact on two specific Managed Kubernetes versions pointed towards a regularly executing process on the worker nodes themselves.

After ruling out any cronjobs as the culprit, the systemd timer upgrading all packages on the Debian-based worker node was identified with a schedule of 06:00 UTC daily (plus a random delay of up to 60 minutes). The team then decided to run the timer manually on a test cluster. During that test run, it was observed how a lengthy upgrade process involving numerous system-critical OS packages was being kicked off, including systemd and the kernel. Although the effects of an expiring 15-minute timeout leading to the harsh termination of the upgrade process could not be fully validated at this point, sufficient data points were collected to consider this very timer job to be the root cause for the incident.

Timeline of Events in UTC

October 8th: 

  • 07:43 - Sporadic incoming customer reports arrive, indicating worker node failures that can be addressed by issuing node replacement

October 9th: 

  • 06:31 - Broader impact occurs with worker nodes affected across multiple regions
  • 07:10 - Internal incident response kicks off, investigation begins
  • 08:13 - Access is gained to logs of a previously affected worker node; machine-wide request failures are confirmed
  • 09:43 - Additional Networking teams provide support to pinpoint the guest OS as the likely root cause; Engineering access to affected nodes is still inhibited
  • 15:28 - Tooling work enabling better analysis on recurrence of the issue is completed

October 10th: 

  • 06:05 - Impact reoccurs with first customers
  • 06:18 - Engineering is able to access and live-debug on an affected worker node
  • 13:08 - The auto-upgrade timer is discovered
  • 14:28 - Behavior of the auto-upgrade timer is confirmed and tied to the incident
  • 14:36 - Work starts to build a DaemonSet-based mitigation fix disabling the auto-upgrade timer on each worker node

October 11th:

  • 03:08 - The fix is released to all affected clusters
  • 06:00/07:00 - No new disruptions/recurrence of the issue are observed

Remediation Actions

In order to avoid another occurrence of the incident on October 11th or later, a quick fix was put together and released. This fix consisted of a DaemonSet workload that disabled the relevant systemd timer on a worker node. This mitigated the problem on all current and future worker nodes.

The next step in remediation involves applying the same fix during our node provisioning process so that the worker nodes no longer have the timer enabled. Once completed, the DaemonSet can and will be removed again from all clusters.

Additional validation is also added to our internal conformance test suite running on all new cluster versions ensuring that the auto-upgrade process and any other undesirable timers continue to stay disabled going forward.

Finally, work is planned to support logging into worker node machines even when conventional access trajectories are impeded.

Posted Oct 13, 2023 - 18:56 UTC

Resolved
Our Engineering team has resolved the issue with the Managed Kubernetes Service. A daemonset has been released to all existing clusters eliminating the auto-update process and it will be removed again going forward. If you find worker nodes that could still be affected by a prior occurrence of this incident, please replace them for permanent mitigation.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Posted Oct 11, 2023 - 07:26 UTC
Monitoring
Our Engineering team has completed the work for both items. New images have been released that eliminate the auto-update process and the daemonset has been applied to all existing clusters.

Given this, the team is not expecting to see a recurrence of this incident around 06:00 UTC.

We will continue to monitor this issue to confirm the fix was successful and will post a final update after 06:00 UTC.
Posted Oct 11, 2023 - 04:06 UTC
Identified
Our Engineering team continues to investigate the root cause of this incident. At this time, we've observed this incident seems mostly to impact Managed Kubernetes clusters running on versions 1.27.6 and 1.28.2.

We've also observed the issue being triggered around 06:00 UTC for the last three days. Due to this, the team believes there is an auto-update process that is triggering a bug when updates are attempted to be installed.

Our Engineering team is currently working on releasing new images that will eliminate the auto-update process and are creating a daemonset to remediate worker nodes already affected by the issue.

We believe these two actions will mitigate the issue and stop it from being triggered again at 06:00 UTC. We will post further updates as work progresses on the items above.

Thank you for your patience.
Posted Oct 10, 2023 - 16:12 UTC
Update
Our Engineering team is working diligently to identify the underlying cause of the issue affecting Managed Kubernetes in all regions. We believe we are getting close to a fix that will solve the issue permanently.

Most of the Managed Kubernetes clusters have been seen to self-heal, or customers have rebooted or replaced nodes to restore cluster health. However, a subset of users may intermittently experience worker nodes entering an unexpected not-ready state which is impacting the Managed Kubernetes cluster accessibility.

We apologize for the inconvenience and thank you for your patience. If you experience an issue with your Managed Kubernetes cluster, please open a ticket with our Support Team.
Posted Oct 10, 2023 - 12:08 UTC
Update
Our Engineering team is continuing to investigate the root cause of the issue impacting Managed Kubernetes across all regions.

Most of the Managed Kubernetes clusters have been seen to self-heal, or customers have rebooted or replaced nodes to restore cluster health. However, a subset of users may intermittently experience worker nodes entering an unexpected not-ready state which is impacting the Managed Kubernetes cluster accessibility.

As soon as we have more details or can confirm that this incident has been fully resolved, we'll update this page. If you experience an issue with your Managed Kubernetes cluster, please open a ticket with our Support Team.
Posted Oct 10, 2023 - 06:54 UTC
Monitoring
Our Engineering team has continued to investigate this issue but unfortunately, have not been able to determine a root cause for why worker nodes are entering the not-ready state unexpectedly. Attempts to replicate this directly have also been unsuccessful. We've observed the majority of Managed Kubernetes clusters self-healing or customers rebooting/replacing nodes to get clusters back into healthy states. We have also observed no new reports of this behavior over the last few hours.

Our Engineering team will continue to investigate this incident and monitor for a recurrence. We will post an update once we have further information or we confirm this incident is resolved.

If you experience an issue with your Managed Kubernetes cluster, please open a ticket with our Support Team.
Posted Oct 09, 2023 - 17:49 UTC
Update
As of 11:56 UTC, our Engineering team is continuing to investigate an issue impacting Managed Kubernetes across all regions. Users may intermittently experience worker nodes entering an unexpected not-ready state which is impacting the Managed Kubernetes cluster accessibility.

Our Engineers are currently working on isolating the root cause of the issue. We do not have a precise resolution time yet however we will be providing updates as developments occur.

We apologize for the inconvenience and thank you for your patience and continued support.
Posted Oct 09, 2023 - 11:58 UTC
Investigating
Our Engineering team is currently investigating an issue impacting Managed Kubernetes across all regions. During this time, some users may experience worker nodes entering an unexpected not-ready state which is impacting the cluster accessibility.

We apologize for the inconvenience and will share more information as soon as it's available.
Posted Oct 09, 2023 - 07:26 UTC
This incident affected: Kubernetes (AMS3, BLR1, FRA1, LON1, NYC1, NYC3, SFO2, SFO3, SGP1, SYD1, TOR1).