Building a Low-Cost Lakehouse for Near Real-Time Analytics with Apache Iceberg and Nessie Catalog

Diogo Santos
9 min readOct 14, 2024

In the previous post, we focused on Apache Hudi and Apache Flink using Apache Hive as the catalog. The key difference now is the shift to using Iceberg as the table format and replacing Hive with Project Nessie as the catalog.

In this post, we will build a Low-Cost Lakehouse for near real-time analytics using Apache Iceberg 🧊 as the table format and Project Nessie 🦕 as the catalog.

💡 By leveraging the use of Apache Iceberg powerful features like full schema evolution, time travel and rollback, hidden partitioning and performance optimization, alongside with Project Nessie, we create a setup that efficiently handles both streaming and batch data ⚡.

🔑 This modern architecture is perfect for real-time insights at a lower cost. The cost analysis is similar to the one provided in this post.

What is a Table Format?

A table format acts like a metadata layer on to of the file, defining how the data should be organized in storage. It’s goal is to abstract the complexity of the physical data structure and facilitate some operations such as Data Manipulation Language (DML) and schema changes.

--

--

No responses yet