Snowflake’s own platform down worldwide for hours after careless software update

Snowflake’s own platform down worldwide for hours after careless software update

An incompatible database structure update caused a global outage at Snowflake on December 16. Users in multiple regions were unable to run queries for hours and experienced problems loading data.

On Tuesday, December 16, customers of Snowflake in at least ten cloud regions experienced a platform outage. The problems lasted from 2:55 to 15:59 UTC, which amounts to almost thirteen hours. During that period, users experienced no or delayed queries. Also, Snowpipe and Snowpipe Streaming – two services for automatically loading data – did not work as expected.

The outage affected data centers in the US (Virginia and Oregon), Europe (Ireland, London, Zurich, Sweden), Asia (Singapore, Mumbai) and Mexico, among others. Users reported error messages such as SQL execution internal error. In addition, data clustering appeared as ‘unhealthy’ in some cases, which could indicate performance issues.

Update error

The cause of the incident was a flaw in a new software version that Snowflake had rolled out earlier. That update contained a change in the database structure that proved incompatible with previous versions. This caused errors when calling certain data fields, leading to version conflicts and failing operations.

No temporary solution was available for the affected users. Only customers who used replication to unaffected regions were able to continue working partially. Snowflake indicated that the situation normalized after the change was rolled back. Some customers may still have experienced delays in data processing due to a backlog of submitted requests, but everything should be working normally again by now.

Impact

Snowflake positions itself as a central platform for all of an enterprise’s data, just like the (AI) applications that are built on it. An outage, and especially one that lasts for hours, therefore has a major impact on companies in their production environments.

The cause of the outage seems to be related to insufficiently robust test procedures in this case. An update that broke compatibility was still rolled out on a large scale and it took quite a while for Snowflake to identify and resolve that problem. Something like that shouldn’t really happen, but it often does in practice. For Microsoft, bugs in updates are almost a monthly occurrence.