Member-only story

Building a Low-Cost Lakehouse for Near Real-Time Analytics with Apache Iceberg and Nessie Catalog

Diogo Santos
9 min readOct 14, 2024

--

In the previous post, we focused on Apache Hudi and Apache Flink using Apache Hive as the catalog. The key difference now is the shift to using Iceberg as the table format and replacing Hive with Project Nessie as the catalog.

In this post, we will build a Low-Cost Lakehouse for near real-time analytics using Apache Iceberg 🧊 as the table format and Project Nessie 🦕 as the catalog.

💡 By leveraging the use of Apache Iceberg powerful features like full schema evolution, time travel and rollback, hidden partitioning and performance optimization, alongside with Project Nessie, we create a setup that efficiently handles both streaming and batch data ⚡.

🔑 This modern architecture is perfect for real-time insights at a lower cost. The cost analysis is similar to the one provided in this post.

What is a Table Format?

A table format acts like a metadata layer on to of the file, defining how the data should be organized in storage. It’s goal is to abstract the complexity of the physical data structure and facilitate some operations such as Data Manipulation Language (DML) and schema changes.

Modern table formats, such as Apache Iceberg, Apache Hudi and Delta Lake, offer guarantees similar to those of Relational Database, including atomicity and consistency, while executing the DML operations.

The Lakehouse Architecture, introduced in this paper from Databrics, brought a significant transformation in data management. Some key areas:

  • Data Schema: The data stored follows a schema, which is defined by metadata describing the tabular structure of the dataset.
  • Query Optimization: The stored data is optimized to support efficient queries, particularly for use cases that rely on SQL-like queries.

What is a Metadata Catalog?

Working with some data sources, it becomes essential to understand which are the data relevant to what we want to do and how we can locate them.

A catalog addresses this challenges by using metadata to help locate the information/datasets and…

--

--

No responses yet