Global Internet Outage Triggered by Internal Error, Not Cyberattack, Says Cloudflare CEO

0
48

NEW DELHI, India — Cloudflare CEO Matthew Prince has confirmed that the massive global outage that disrupted major platforms worldwide was caused by an internal configuration error rather than a cyberattack.

The outage affected services including X, ChatGPT, Canva, Discord, and numerous other websites and apps across the globe, leading to widespread speculation about a possible security incident. Prince clarified in a detailed postmortem that the root cause was a flawed update to permissions on a ClickHouse database cluster.

According to Cloudflare, the update was intended to streamline data access. However, a faulty query caused the system to pull in far more data than expected, generating a key “feature file” for the company’s Bot Management system that grew beyond safe limits. This feature file is refreshed every five minutes and shared across Cloudflare’s global network.

When the oversized file propagated through the network, it exceeded software thresholds and caused routing systems at the network edge to crash. The situation became especially unstable because the corrupted file was generated only on certain updated parts of the cluster.

As a result, Cloudflare’s network repeatedly cycled between partial recovery and fresh failures every five minutes, depending on whether it received the correct or corrupted file. This fluctuation continued for roughly three hours starting around 11:20 UTC, causing interruptions across the internet.

Prince said the symptoms initially resembled a massive DDoS attack, leading engineers to misdiagnose the issue before uncovering the true cause. Cloudflare eventually halted the spread of the faulty file, restored an older working version, and restarted the affected systems. The company reported that full service was restored by 17:06 UTC.

Prince apologized for the disruption and called the incident Cloudflare’s most serious outage since 2019. He said the company will implement new safeguards, including stricter file-size limits, global kill switches for critical updates, and a comprehensive review of failure points in its core infrastructure. (Source: IANS)