AWS explains supporting Iceberg

aws

AWS is using the Apache Iceberg open table format for its analytics, machine learning and storage. It does so at the request of customers using S3 object storage.

AWS embraces the open Iceberg format, but why? Iceberg was first mentioned in 2023 in a preview of Redshift, a data warehouse that customers can use to run analytical queries in external data lakes. But why is AWS just using this format and not Delta Lake, for example?

Why Iceberg?

Iceberg was developed by Netflix in 2015 because Hive Tables on S3 did not meet its needs. Since then, it has been an open format. Iceberg adds an extra layer of metadata to datasets that allows tables to be modified without having to re-transmit the entire dataset.

According to Andy Warfield, engineer at AWS, Iceberg helps customers combine databases with data stored in S3. The metadata in Iceberg allows users to easily make adjustments to their datasets.

AWS introduced S3 Tables at re:Invent 2024, a new Iceberg feature that makes data analysis faster through pre-partitioning and automatic updates and optimizations. In addition, it works well with popular tools such as Sagemaker and Redshift, The Register knows. The Iceberg approach is also used in Sagemaker, the machine learning platform, and serves to facilitate some aspects of data warehousing, analytics and data alerts.

AWS chose Iceberg because of its broad support from technology companies such as Google and Snowflake. As a result, the cloud giant omits Delta Lake. That format was developed by Databricks and is very popular with Microsoft. Delta Lake is also open source, but AWS feels that Iceberg meets its needs both technically and practically to meet current customer demand.