On July 24, 2024, DigitalOcean experienced downtime from near-simultaneous crashes affecting multiple hypervisors (ref: https://docs.digitalocean.com/glossary/hypervisor/) in several regions. In total, fourteen hypervisors crashed, the majority of which were in the FRA1 and AMS3 regions, the remaining being in LON1, SGP1, and NYC1. A routine kernel fix to improve platform stability was being deployed to a subset of hypervisors across the fleet, and that kernel fix had an unexpected conflict with a separate automated maintenance routine, causing those hypervisors to experience kernel panics and become unresponsive. This led to an interruption in service for customer Droplets, and other Droplet-based services until the affected hypervisors were rebooted and restored to a functional state.
July 24 22:55 - Rollout of the kernel fix begins.
July 24 23:10 - First hypervisor crash occurs and the Operations team begins investigating.
July 24 23:55 - Rollout of the kernel fix ends.
July 25 00:14 - Internal incident response begins, following further crash alerts firing.
July 25 00:35 - Diagnostic tests are run on impacted hypervisors to gather information.
July 25 00:47 - Kernel panic messages are observed on impacted hypervisors. Additional Engineering teams are paged for investigation.
July 25 01:42 - Operations team begins coordinated effort to reboot all impacted hypervisors to restore customer services.
July 25 01:50 - Root cause for the crashes is determined to be the conflict between the kernel fix and maintenance operation.
July 25 03:22 - Reboots of all impacted hypervisors complete, all services are restored to normal operation.