Cloudflare Reveals Cause of Global Outage

Cloudflare Reveals Cause of Global Outage

Not a cyberattack, but an enlarged feature file created a software error within Cloudflare’s systems, causing certain online services to be down for hours.

On Tuesday, November 18, several online services such as OpenAI, X, and Ikea were down for hours. After more than three hours, the websites were functioning normally again. A cyberattack was certainly not the cause of the technical disruptions, but rather an error in the internal configuration of the Bot Management system. Cloudflare’s CEO explains exactly what happened in a blog post.

Enlarged Feature File

“The problem was not caused directly or indirectly by a cyberattack or malicious activities of any kind,” emphasizes Matthew Prince, CEO of Cloudflare, in a blog post.

read also

Cloudflare experiences global outage: X and OpenAI affected

According to Cloudflare, the outage was caused by a change in the access rights of a database system. This change inadvertently caused the system to include multiple entries in a so-called feature file. This file plays a role in the operation of Cloudflare’s Bot Management system. Due to the error, the size of the file doubled.

Software Error

The file was then automatically distributed to all machines within Cloudflare’s network. The network software, responsible for routing traffic, relies on the file but had a set limit for the maximum file size. When the file exceeded that limit, the software failed on multiple systems.

Suspected DDoS

Initially, the team thought it was a large-scale DDoS attack, but after further investigation, the real cause was identified. Cloudflare was able to stop the distribution process and roll out a previous, working version of the file. Around 2:30 PM, the network began to recover. By 5:06 PM, all systems were operational again.

“We apologize for the impact on our customers and on the internet in general. Given Cloudflare’s importance in the internet ecosystem, any outage of one of our systems is unacceptable,” Prince stated in the blog post. He also provides an in-depth account of exactly what happened and which systems and processes failed.