Delta Lake: My Learnings

Introduction 

When I mention Delta Lake, you might picture a serene body of water. Nope—think again. Delta Lake isn’t about nature; it’s about taming the wild chaos of data.

Let’s rewind. Why do data lakes even exist? Data’s a mess—unruly, unpredictable, and rarely fits neatly into boxes. Relational databases like SQL tried to keep up, but they buckled under the sheer volume of modern data. Data warehouses stepped in, promising order, but they were sluggish and expensive. Then came data lakes with a bold pitch: store everything—cheaply—and process it on demand with powerhouse tools. 

So, how do we wrangle it in data lakes? Columnar storage is the MVP here. Think file formats like Apache Parquet or Optimized Row Columnar (ORC)—efficient, scalable, and ready to tackle the madness.

But here’s the catch—those formats weren’t perfect. Far from it. Sure, Apache Parquet and ORC brought efficiency to the table, but they came with baggage.

1. There were no ACID transactions. What does this mean?

    • ACID stands for Atomicity, Consistency, Isolation, and Durability, and simply put, it means that we want operations with our data to meet these criteria so that we can have safety in writes and deletes with minimized corruption of data.

2. There were major observability and data management issues. 

3. Downstream failures from lack of schema enforcement. 

    • Imagine this. In an Apache Parquet file format, every file is a big box. Back then, you could put literally whatever you wanted (mini-boxes per say), inside of each box. With no rule enforcement, it leads to this concept known as schema drift - where basically the structure of our database can change over some time horizon without us even knowing.

This is where the brilliant invention by Databricks comes in 2019, the Delta Lake. 

The Delta Lake Storage Layer

The idea is quite simple. Delta Lake is a storage layer that is built on top of cloud object storage (like Amazon S3), that basically solves all the problems we just discussed above. We have transaction logging for observability, ACID transactions, schema enforcement, and even cool features like time traveling in your data. 

You can think of the storage layer as a software layer that defines a format and some particular protocols for managing tabular data (data that is displayed in columns or tables) + metadata within an S3 bucket. Bare in mind that it can be any object store, such as Azure Blob Storage for example. What is even more fascinating is that this layer is actually built on top of Apache Parquet. So in a sense we never stepped away from Parquet... Databricks  simply made it better.

Storage Layer Breakdown

When we think of this software layer, it can sometimes be hard to visualize mentally, so lets break it down a bit more:

1. Backend: A cloud object store holding parquet files and transaction logs. You have the flexibility to choose the backend storage system, as it’s an open-source storage layer that sits on top of your data lake rather than being tied to a specific storage provider. S3, Azure, the world is your oyster. This is why you'll see a lot of companies build their own unique solutions.

2. Files: Parquet format. What's brilliant about Parquet is that it is so incredibly efficient to store in this format. There's a lot of reasons, the main being that Parquet has columnar storage + does advanced data compression. Now don't panic! Columnar storage sounds fancy but really what it is saying is that it enforces column based storage, which enables cool things like improved compression (because a lot of the times data types will be the same in a column) or more efficient disk reads because I don't have to read a bunch of data within a row to get to a particular column, I can just directly grab that column. Columnar storage sacrifices row-level efficiency for column-level efficiency. It’s optimized for queries like "average age across all rows" (one column read) rather than "everything about John Doe" (all columns for one row).

3. Transaction Logs - this is pretty simple to be honest. The data is in JSON format for the logs, and we can actually checkpoint our data using Parquet. Databricks actually calls it the DeltaLog (https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html)

4. Now the actual computation and querying layer, which is handled through Apache Spark. When I say computation and querying layer, simply imagine that any code that is interacting with the tables (from reading, to writing, to updating, etc...) is being handled by this layer. It is important to note that Apache Spark is NOT a programming language. It is simply an analytics ENGINE that will perform complex computations efficiently. The Delta Kernel which is newer, now supports other engines like Trino, Flink, Presto, Hive, DuckDB, etc... but Spark remains the main and default supported engine. Let's dig a little bit into engines to understand them further and solidify this idea.  

For example... imagine I have a chicken shack and I am selling 10 chickens a day. I start to scale, and I now have millions of chickens being sold with millions of transaction logs. I first need to figure out a cheap way to store them.... so I decide to build myself an organized library with a unique book placement and checkout system. This we can call the Delta Lake. But now I want to analyze my transaction logs to improve the efficiency of my business. I could go through every log by hand... (aka. look at every book in the library)... but that wouldn't really make sense. Instead, I hire a super-robot to read go through every book in my library... analyze it, organize it, and give me answers to the questions I need. At its core, think of Apache Spark as the super-robot brain that organizes my library (the Delta Lake). 

5. Now this hypothetical "super-robot"... (like real world robots), won't really perform any tasks until we give it the proper commands/operations. Think of Apache Spark as your super-robot, that is awaiting commands and operations from the user/developer (you) for it to take action on. There is a wide range of ways to define these operations. And this list is not extensive. Delta Lake API is a big one... and allows us to programmatically perform advanced operations that you'd typically do in SQL, like MERGE. Starting to see the power of this? There is also something called SparkSQL... so you can actually just use plain SQL commands to manipulate your tables if you don't want to write some Python code for your manipulations. There is also an Apache Spark DataFrame API... which really gives you that Pandas DataFrame feel while interacting with Delta Tables. Brilliant! 

Now what is super cool is that this abstraction has changed the way corporations and developers interact with data. It has allowed companies like Databricks to build unique feature sets for their customers to adapt in a strongly data driven world. Let me give you an example. What if I want to expose a subset of my historical data to a third party to access, for governance? No problem. I can use the Delta Sharing feature to expose Delta tables via a RESTful approach. Brilliant right?

Limitations of Delta Lake Approach

Now, in the paper... the authors do pose that there are some limitations to this storage layer approach:

1. Serializable transactions are limited to a single table. Serializable simply means that if I played back all the transactions in sequential order, the outcome would end up being the same. 

2. For streaming workloads, Delta Lake's performance is limited by the latency of the underlying cloud object store. Now to be fair, this isn't really that big of a bottleneck, especially because cloud providers like AWS have such incredible solutions for object storage with super minimal downtimes anyways. 

3. So originally, Delta Lake did not support secondary indexes beyond the min-max statistics kept for each data object - but at the time of publication, something called a Bloom filter-based index was being prototyped. Think of min-max stats as a simple way of getting the smallest and biggest values in each data file. Bloom filter indexing is a much fancier tool that basically guesses if a specific value is inside of a file, instead of looking at ranges. This matters a lot because when we are working with huge piles of data, these indexes tell Delta Lake which files to skip. It can help a lot in messy data lakes with millions of files. It is a path to achieving much more real-time queries by ignoring irrelevant data. 

I don't know what is next in this world of storage layers and file formats, but I've heard that Apache Iceberg is kind of becoming a direct competitor to the Delta Lake format. I might write about it some other time, but what I do know is that it was born at Netflix, and has much less vendor lock in - which is a common complaint with Delta Lake's tight integration with Spark.