DaaS: Building a Low-Cost Lakehouse for Near Real-Time Analytics in Flink and Hudi
5 min readApr 8, 2024
This solution uses a combination of Apache Flink for near real-time data processing and Apache Hudi for managing our data storage layer. Key achievements levering the use of both technologies:
- ✅ Incremental Data Updates: Data available is updated instead accumulating redundant information.
- ✅ Efficient Upserts with Apache Hudi: Hudi indexing capabilities allow for quick and efficient data updates.
- 🔥 Partial Updates: Allows us to modify specific fields within a record without the need to rewrite the entire record, simplifying the data pipeline.
- 🔥 Compaction: Compaction is a built-in feature provided by Hudi for Merge On Read(MOR) tables to merge updates from row-based log files to the corresponding columnar-based base file.
At the center of this solution is the data lakehouse architecture, a model that combines the scalability of data lakes with the management features and performance of traditional data warehouses at a reduced cost.
💸 Costs Overview
Architecture 1: RDS PostgreSQL with Microservices in Java
- RDS PostgreSQL (db.m4.large, Multi-AZ, 30GB Storage): $589.04/month