On January 23, 2024 at 12:30 PM UTC, we suffered an outage with an internal infrastructure cluster that powered a subset of DOKS clusters in the fra1 region. Specifically, an incorrect configuration was inadvertently applied to the infrastructure cluster during a maintenance operation, which then triggered a cascading event leading to the automatic deletion of the involved DOKS clusters. As a consequence, the control planes became inaccessible, and control plane data loss occurred.
Jan 23, 2024
12:21 - The incorrect configuration change is applied to the FRA1 internal cluster and triggers the automatic control plane deletion.
12:30 - The automatic deletion process removed all the control planes of the DOKS clusters from the specific infrastructure cluster.
12:36 - The incorrect configuration was identified as the root cause, preparations for recovery begin.
12:48 - Internal incident response kicks off to identify impacted clusters and track the recovery process.13:00 - Unexpected problems during the recovery process are encountered, delaying the full recovery and requiring additional patching of the recovery mechanism.
15:00 - The recovery problems are fully addressed.
20:31 - All impacted HA DOKS clusters are recovered.
21:21 - All impacted DOKS clusters are recovered.
As of this week, a fix to prevent the automatic deletion of infrastructure cluster resources under any circumstances in the production environment has been rolled out.