Lakehouse — New kid in data town

As the data volume generated by the new digital era platforms are continuously and exponentially increasing and the insights generated from the data is becoming more and more valuable and critical, there are a bunch of architecture innovations and reforms happening. In this article we will have a close look at one of the most disrupting architecture in data analytics space.

Let’s have a look at a traditional data analytics platform

Image 1 : Data analytics platform powered by data lake and data warehouse

The core components in the above architecture data ingestion, data lake, ETL and a data warehouse. Data lake and data warehouse are similar and different at the same time. The similarity is basically they are storage platforms for keeping the data. Now let’s look at the differences.

Table 1 : Data lake vs Data warehouse

The million-dollar question — Is there a way to combine data lake and data warehouse?

Obviously, someone would ask. “Can we have a platform which can serve both as data lake and data warehouse” ? The answer is “Data lakehouse”

Data lakehouse is a new architecture that emerged recently which combines the best of both data lake and data warehouse. In a nutshell, data lakehouse enables

ACID compliance and full transactional update capabilities on data lake
High performance query execution
Data science and data engineering workloads
Unified real-time and batch data processing
Schema enforcement and data governance

Image 2 : Simplified view of data lakehouse

As you can see in the diagram, lake house storage acts as both data lake and data warehouse. This is a very simplified version of the architecture. We will now look at a few of the innovations and software products that made this architecture possible. They are quite interesting.

Some popular early players (they are still very popular)

Apache Presto: An open-source, distributed SQL querying platform that runs on a cluster of machines. Originated at Facebook, Presto is now one of the popular query engines out there. It enables analytics using ANSI SQL on large amounts of data stored in a variety of systems such as HDFS, cloud storage, NoSQL stores etc. Presto enables data warehouse like query performance for BI and reporting tools.

Apache Drill: Similar to Apache Presto, Drill is another open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, using ANSI SQL. Drill originated at MapR (now acquired by HP Enterprise) and is based on Google’s Dremel.

Both Presto and Drill are Apache top level projects

Amazon Athena: This is an AWS native version of Presto. It provides all functionalities of Presto but there is no need to install or manage the infrastructure. You will only have to pay for the queries fired and the data scanned. Even though Athena is meant to query data stored in AWS S3 data lake, federated queries along with source connectors can be used to query a large variety of data sources.

Amazon Redshift Spectrum: Another AWS product based on Amazon Redshift, popular data warehouse product offering from Amazon. Redshift spectrum leverages the parallel processing capabilities available in an already provisioned Redshift cluster to connect and query data available in Amazon S3 storage.

Google BigQuery: Based on Google’s Dremel, it is a data warehouse offering from Google cloud. Unlike a typical data warehouse solution, BigQuery supports machine learning workloads using BigQuery ML. It can also connect to external big data storage systems to query data using Federated queries.

As you can see, there are a variety of products available. But none of them truly possess all the capabilities of a data lake house.

The capabilities of ACID transactions are limited to none. Most of these work on immutable storage and updating records is impossible. Also, it is difficult to achieve a unified real-time and batch data layer. Moreover, we require different storage solutions for meeting all our requirements and that deviates from the concept of single data storage for both data lake and data warehouse. players (they are still very popular)

Delta Lake, Apache Iceberg and Apache Hudi

These platforms implement data lakehouse using a metadata layer based on open table formats on top of the data lake storage solutions such as HDFS, AWS S3, Azure Blob, Google Cloud Storage etc.

Delta Lake: Delta lake was created by Databricks and then open sourced. It is based on the popular file format Apache Parquet and uses transaction logs created as JSON files to support ACID transactions on the data lake. Delta format is supported by many data processing frameworks which can now leverage this framework to enable data warehouse capabilities on the existing data lake storage systems. Other key features include

Scalable metadata on billions of partitions and files with ease

Time travel to old data for audit or rollback

Unified batch/streaming with transaction capabilities

Schema evolution/Enforcement — Prevent bad data

Audit history using transaction logs for full audit trial

DML operations — SQL, Scala/Java and Python APIs to merge, update and delete datasets

Apache Iceberg: Another open-source project which was initially developed at Netflix and currently an Apache top level project. Apache Iceberg is an open table format designed for huge, petabyte-scale tables. The function of a table format is to determine how you manage, organize and track all of the files that make up a table. It supports multiple file formats such as Apache Parquet, Apache Avro and Apache ORC. Using a combination of metadata, manifest and data files, ACID transactions are supported on data lake storage. Other capabilities include

Full schema evolution to track changes to a table over time

Time travel to query historical data and verify changes between updates

Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories

Rollback to prior versions to quickly correct issues and return tables to a known good state

Advanced planning and filtering capabilities for high performance on large data volumes

Apache Hudi: Open-source product initially developed at Uber. Hudi (Hadoop Upsert Delete and Incremental) initially started as a streaming data lake platform on top of Apache Hadoop. It has evolved a lot now to support incremental batch jobs on any cloud storage platform. Hudi uses Apache Parquet and Apache Avro and enables ACID transaction capabilities. Other features include

Upserts, Deletes with fast, pluggable indexing

Transactions, Rollbacks, Concurrency Control

Automatic file sizing, data clustering, compactions, cleaning

Built-in metadata tracking for scalable storage access

Incremental queries, Record level change streams

Backwards compatible schema evolution and enforcement

SQL Read/Writes from data processing frameworks such as Spark, Presto, Trino, Hive & more

Products for Enterprise

Now let’s look at some enterprise offerings:

Delta lakehouse by Databricks: Based on the delta lake and developed by Databricks which can run on popular cloud providers. The entire platform is powered by Apache Spark, the most popular data processing platform in the market now. By combining the capabilities of Delta lake and Apache Spark, delta lakehouse provides capabilities such as

Lightning-fast performance for data processing with auto scaling and indexing

Data science and machine learning workloads at scale using MLflow

Databricks SQL, which is server less SQL query execution engine exclusively for ultra-fast BI and dashboarding requirements

Unity catalog — A unified data catalog for all data in the delta lakehouse which also provides data governance and security capabilitiesBuilt-in dashboarding platform to create reports and visualizations

Built-in dashboarding platform to create reports and visualizations

Image 3 : Delta lakehouse by Databriks

Dremio: Product offering based on Apache Iceberg. Unlike Databricks, Dremio can work both on cloud as well as on-premise. Unlike Databrick platform, Dremio does not provide bult-in machine learning or data science capabilities as the focus is mainly on data engineering, data warehousing and fast query performance. With high performance SQL query engine and data transfer capabilities, Dremio solves the problem of combining data lake and data warehouse. The two services that are powering Dremio platform are

Dremio Sonar — A lakehouse engine built for SQLDremio Arctic — A metadata and data management service for Apache Iceberg that provides a unique Git-like experience for the lakehouse

Final thoughts

I have done my best to capture relevant details on each of these platforms. The ecosystem is always growing powered by more and more contributions to the open-source community. The beauty of these open-source platforms is that you can get all details on the internals and even more, you can check out the source code and start contributing. There is no limit for learning

References

ACID transactions
Apache Presto
Apache Druid
Amazon Athena
Amazon Reshift Spectrum
Google BigQuery
Delta lake
Apache Iceberg
Apache Hudi
Databricks lakehouse
Dremio

Lakehouse — New kid in data town

Leave a ReplyCancel Reply

Manu Mukundan

Get updates on our Insights

Get in touch with us

Get updates on our Insights

Get in touch with us

Get updates on our Insights

Get in touch with us

Newsletter

qUICK lINKS

Industries

Services

Follow us

Newsletter

qUICK lINKS

Industries

Services

Follow us