Rust and Crates

Introduction

Rust is a brilliant programming language that has taken the development world by storm. It combines performance, memory safety and concurrency, making it the perfect language for a variety of use cases. 

Rust falls into a category we call "multi-paradigm" programming languages. It has some functional programming concepts, but it is entirely designed around memory safety and performance paradigms, more specifically leaning towards the world of systems and concurrency.

There are a couple of key conceptual pieces that I have learned about Rust that I'd like to share. Let's start.

The Rust Compiler (rustc)

The Rust compiler today (rustc) is known as a self hosting compiler. What this essentially means is that the compiler is written with the same source code that it compiles. Interestingly enough, the first Rust compiler was actually built in OCaml, which is one of the most powerful functional programming languages in the world. 

LLVM, which stands for "Low-Level Virtual Machine", is the core backend of the Rust compiler. There are two steps involved: 

  1. rustc will convert the code to something called an "intermediate representation". We primarily care about this because there is such a diverse range of target architectures (x86, ARM, etc)... making it a much more complicated implementation to generate machine code for each specific processor that exists and comes out in the future. 
  2. This intermediate representation is then passed onto LLVM, which converts it into machine code, which is basically sequences of binary digits.

Interestingly enough, LLVM is written in C++, and is an extremely mature piece of software that will remain foundational to a lot of programming languages for years to come. It actually came out of a research project by Chris Lattner at UIUC. If you aren't familiar with Chris Lattner, take a look at his recent company Modular AI. They are essentially building a simplistic python language with the performance of CUDA for GPU programming. Cool right?

Cargo

Simply put, cargo is a package manager and build tool for Rust. It helps us build, run, test, and manage dependencies super easily. Cargo essentially manages crates, and can fetch them and ensure they are properly compiled and linked. 

Crates 

Rust lives and breathes off open source. There is a massive crate registry here called crates.io. You're probably wondering, what are crates?

They are simply an organizational unit. They can represent a single executable in your project (such as the main file), or they can represent an entire library/project allowing us to pull reusable functions, types, etc. One crate can use another crate - which is a foundational concept to modularity in Rust.

Compile Time Enforcement

The way Rust makes guarantees to the developer and the system is it uses the compiler to enforce rules. This enforcement is driven by ownership, borrowing and lifetime principles. The part of the compiler that is enforcing these rules is the Borrow Checker. Let's expand on the three below.

Ownership

Every single piece of data in rust has a single owner. If the owner goes out of scope, the data is deallocated. For example: 

This string will live and die within the boundaries of the brackets. This essentially means that when the program reaches the closing brace, the scope has ended, Rust will call a drop function and deallocate the memory that was being managed on the heap for the string "My Name is Omeed".

Now you can also transfer ownership, which is essentially doing variable reassignment, but in the process, you'll kill the original variable. For example, if I have a variable x = 5, and I set y = x, by default, x will die. This is a memory safety feature that will avoid accessing freed memory or freeing the same memory multiple times. 

Borrowing

Let's say we don't want to take ownership at all. We can borrow a variable. You can borrow using references. In Rust, the referencing symbol used is &. It is important to not get this confused with address-of operator in C. It is a bit harder to get it mixed up with C++ references, since the syntax explicitly uses ref and is essentially an alias to an existing variable vs. borrowed pointers with immutable and mutable options. Ok, I know that was a lot... but let's break down borrowing a bit more now.

In Rust borrowing, you can essentially have one mutable reference or unlimited immutable references. Here's the catch, they can not co-exist. Let's say I have a string x. I can do 20 immutable borrows if I want using the operator. But... if I try to do a mutable borrow as well on x using &mut, the compiler will enforce its rule and compilation will fail.

This might not seem shocking, but this is the CORE reason that data races do not happen in Rust. In a concurrent system, if an object is immutable and mutable simultaneously, one thread could read a variable while another thread in your system is modifying it, which is not good.... essentially the first thread would see garbage because the second thread overwrote it in the middle of a read.

You're probably thinking... doesn't this make the language feel a bit "limiting" compared to the multi-pointer reference paradigm in C++? True. You aren't wrong. But this guarantees correctness and safety which can be quite powerful in a "mission-critical" software engineering setting. Life is all about trade-offs, this is one of them.

Additionally, there are things like concurrency helper functions (Mutex for example), ownership transfer, lifetimes (which I'll talk about in the next section), etc. 

Lifetimes

In a lot of programming languages, like Java or Python, there is automatic garbage collection... which naturally comes with overhead. In C++, there is no automated garbage collection, but reference management is essentially up to the user - you are the garbage collector in a sense. This granular control can in turn lead to better performance and more predictability in your code. In Rust, these concepts do not exist. Rather, we rely on lifetimes paired with all the other concepts discussed above. 

The way to think about lifetimes is that the compiler is essentially trying to avoid a reference outliving its parent. If I have a variable x, I don't want reference y and z to outlive x. If x is deallocated, I want y and z to be invalid as well. The reason we say invalid is because references don't actually own heap memory, their lifetime is simply being enforced by the borrow checking feature. 

Usually, when it comes to lifetimes, the borrow checker feature of the compiler is actually doing an inference based on references throughout your code. But, in scenarios where there is a lot of complexity, you can explicitly annotate variables using the 'a lifetime parameter to tell the compiler relationships between lifetimes so it can more accurately infer.

Final Thoughts 

Overall, I hope this post has been educational in learning some of the baseline concepts of Rust and what makes it so powerful.




The LLM Fear Mongering

I recently came across a quote by Yann LeCun that stuck with me: “Inventing new things requires a type of skill and ability that you are not going to get from LLMs.” It’s a simple statement, but it cuts through a lot of the noise we hear about artificial intelligence today. And I think it’s spot on.

I fundamentally believe that as long as this remains true, fields like computer science, engineering, law, the arts - and so many others - will stay not just relevant, but essential to pursue and study. Anyone claiming otherwise, telling you that AI will render human effort obsolete, is likely fear-mongering for their own gain. Don’t buy it.

Think about Moore’s Law for a second - the observation that computing power doubles roughly every couple of years. It’s not just a rule about silicon chips; to me, it reflects a deeper philosophy of progress. You could argue there’s a parallel law at play: human ingenuity scales alongside our tools. As technology accelerates, so does our capacity to learn, think critically, and invent - not because machines take over the heavy lifting, but because we wield them. The better our tools get, the more we’re challenged to step up, to master them, to push beyond what they can do alone.

That’s the beauty of it. Large language models like the ones powering chatbots or writing assistants can churn out text, analyze data, even mimic creativity to a point. But invention? True, original creation? That’s still ours. It’s the spark that comes from wrestling with a problem, from seeing connections where none existed before, from daring to fail and try again. No algorithm’s going to replicate that - not yet, and maybe not ever.

So here’s my take: we’re not in an era where AI replaces us. We’re in a golden age of learning. The tools we have today amplify what’s possible, but they don’t erase the need for human curiosity, grit, or imagination. If anything, they demand more of it. Study the fields that excite you. Build things. Ask questions. Stay curious, my friends - because that’s how we keep writing the story of progress. 

What is gRPC?

Background Context

We use protocols for a communication between the client and server.

Think about it like this: 

  • API is a set of functions, endpoints and tools defining what data is moving and why
  • Protocols set rules for how the data is exchanged and moved. 
  • Protocols enable APIs to work. The foundation of APIs are protocols.
  • Finally, just understand that protocols operate at a lower level of abstraction than APIs. 

If you've ever created an API, chances are that you have used the RESTful protocol. REST stands for "Representational State Transfer" - and is implemented over HTTP/1.1 and HTTP/2 using the classic methods like PUT, POST, GET and DELETE. RESTful protocols are super common in software engineering, and we typically send and receive payloads in JSON format. Super simple and easy to use.

gRPC

gRPC is a "newer" protocol per say, that is implemented over HTTP/2. It stands for "Google Remote Procedure Call". Instead of operating off an endpoint paradigm, it operates like functions. Streaming is built natively into the protocol, and you don't need to use WebSockets with your API to persist a connection. The data format is also protocol buffers, more popularly known as protobuf, which Google developed to serialize structured data. The protobuf compiler will essentially create code to serialize and deserialize your data, and it exists as a binary string that improves performance when processing or moving over a network or even being stored. Serialization also just simply means converting to a transmittable format. 

The most common use cases for gRPC are microservice communication in distributed infrastructure and real-time data streaming. It is highly performant and interoperable due to the .proto file that specifies configurations for the protobuf compiler code generation which  supports a huge range of languages. 

One thing I did also observe is that gRPC does not list rust as a supported language. But there is a rust implementation of gRPC in Rust called tonic that has been growing in popularity (https://docs.rs/tonic/latest/tonic/). 

My Thoughts

I'm beginning to realize that with the growing wave of AI, Moore's law and the continuous push by huge companies to invest in infrastructure, this protocol will increasingly become more important to learn among new software engineers. Stay tuned, as I will be writing some code to learn gRPC more in depth. :)

The Dynamic TanH (DyT)

Introduction

A really interesting new paper by Yann LeCun, Jiachen Zhu, Xinlei Chen, Kaiming He, and Zhuang Liu (https://arxiv.org/abs/2503.10622) introduces a novel approach to replace normalization layers in Transformers with a simple operation called Dynamic Tanh (DyT) - which is challenging the need for normalization all together.  

Understanding Normalization

In the world of neural networks, reducing covariate shift is important. Covariate shift basically means that the data you trained your model on does not match the data it sees later when you use it. A common thing that happens in Neural Networks is we encounter internal covariate shift. What this basically means is that as we pass through the layers of the network and weights are being tweaked, the data's distribution continuously changes. 

This internal covariate shift is not tied to the data's diversity that you are training on, but is tied more to the architecture of the network and how things are behaving as we move through the layers. One of the most common ways that we reduce internal covariate shift is normalization, so the data's distribution is more consistent as we move through the network. Let's look at a brief example of normalization from my Decision Transformer project code: 

Think of this as a normalization switch board. It allows us to do either layer normalization, batch normalization or skip it all together. After an option is selected, it applies the chosen normalization to the input value x (a tensor) during the forward pass. It is important to note that there are a bunch of flavors of normalization beyond this. For example, root mean squared is a simple flavor of layer normalization. Instance normalization is another one, where we basically do batch normalization but per instance. The reason this "switchboard" even existed from my teams code was that we were using the transformer for robotic simulation.

In modern transformer architectures like GPT, LLaMA, etc.. layer normalization is the most popular and used quite religiously. For each data point (like a token in a sequence), it computes the mean and variance across the features (e.g., the hidden dimensions) rather than across a batch of examples. Then it scales and shifts the result using learnable parameters. 

Think about it like this, if I have a batch of sentences that is 10 and a batch of sentences that is 50, immediately I run into problems using something like batch normalization with transformers... because now the dimensions of my batch are misaligned. Layer normalization will work consistently with no alignment. 

For example, let's say I have a sentence that is "Omeed likes to eat potatoes", and I am looking at the "Omeed" token. Hypothetically, let's apply the BERT embedding model to yield a 768 dimension vector representing my name. When I normalize with layer normalization, it'll only look at the embedding vector for my name and calculate the mean and variance for that to normalize - completely independent from the remaining tokens in my batch ("likes", "to", "eat", "potatoes", etc.). By normalizing per token, we essentially are preserving the individual characteristics of each word, which is important to natural language. 

To illustrate with a concrete example, let’s examine Andrej Karpathy’s implementation of LayerNorm. What’s brilliant about his approach is that he wrote it in Python, unlike the underlying PyTorch implementation, which relies on low-level languages for optimal performance. This choice makes it much easier to understand:

As you can see... we center the data, scale it, and adjust it with weights and biases. Think messy pile of data to a neat, organized and polished format. To learn more about this snippet, I highly recommend visiting this link.

Now all this normalization talk does come at a cost. During training time and inference time, there is computational overhead to normalization. With batch normalization, there is no "on-the-fly" computation of mean and variance, so the compute cost during training is quite significant. With layer normalization, based on the input size at inference time, there can be significant computational overhead. But to a lot of scientists and researchers, this is a small price to pay for training stability and model performance. 

The DyT

Now imagine you can go into your code, replace all layer normalizations with DyT, and significantly cut latency and memory usage. Let's first just start with the math. It'll be easier to explain after you've seen the equation representing Dynamic TanH:


So let's start with replacement. In a typical transformer architecture, you will apply layer normalization before the multi head attention, before the feed forward neural network and before the output projection. Using DyT is as simple as going into those parts of your architecture, and replacing them. Now what is actually happening in this? Well you still have your input tensor x. This is multiplied by a learnable scalar α, which can stretch/compress your input values. Think of this as some automatic knob that learned during training and is optimized via back propagation. It is adjusted automatically during gradient descent, a simple way to minimize error by taking tiny steps towards an optimum. We care about this because the paper shows how linear normalization in transformers produces a tanh like mapping: 
Meaning that if we optimize for a good tanh output, because its S-shaped, bounded between -1 and 1 with smooth behavior - it'll naturally replicate linear normalizations key effects. But scalars like α matter because the inputs to this function might have super large or super small values during train time, leading to it being too flat or too saturated - so we are ideally looking for a sweet spot. 

Without over complicating it, γ and β are learnable vectors... and simply optimized during training via back propagation. Putting it all together, this is the final code snippet. 

The weight γ starts as an all one vector, bias β starts out as an all zero vector, and the alpha value (our scalar) starts out as .5, which is a value predefined in the paper. In the forward pass, it is all simply put together, and matches the formula we discussed above. 

For years, we believed normalization was indispensable for stable neural network training, but this groundbreaking discovery of Dynamic Tanh has me captivated by the innovative ways researchers will now design efficient, high-performing models.

What are inverted indices?

Back when Laith and I were creating the Nera search engine, we dove into a world of technicality that neither of us had seen in our years of college. There were many backend pieces to the puzzle, and LLMs were definitely involved. We found that to build a great search, usually you want to choose technologies that complement LLMs, they are not the only one and done solution.

One of these puzzle pieces was the "inverted index". This is a data structure that maps each unique term—think words or phrases—to the specific documents (and often positions within those documents) where it appears. By organizing data this way, we could dramatically speed up full-text searches, making it possible to quickly retrieve relevant results from a massive collection of documents.

We started by scraping the internet, focusing primarily on local websites for cities like Waco to gather information about things to do. We collected this raw data—think event listings, attraction descriptions, or activity guides—and converted it into individual documents. Then, to prepare the text for indexing, we ran it through a stopword filter, stripping out common words like 'the,' 'and,' or 'is' that don’t carry much meaning for search purposes. 

Next, for each unique term that was left after filtering, (say, 'museum' or 'hiking' from a Waco activities page), the inverted index records where that term appears across all our documents. It is implemented as a hash table, with the filtered terms as keys and the values as lists which contain the document IDs and positions within the documents. Here is a basic graphic below:

And that's it! Cool right?

Delta Lake: My Learnings

Introduction 

When I mention Delta Lake, you might picture a serene body of water. Nope—think again. Delta Lake isn’t about nature; it’s about taming the wild chaos of data.

Let’s rewind. Why do data lakes even exist? Data’s a mess—unruly, unpredictable, and rarely fits neatly into boxes. Relational databases like SQL tried to keep up, but they buckled under the sheer volume of modern data. Data warehouses stepped in, promising order, but they were sluggish and expensive. Then came data lakes with a bold pitch: store everything—cheaply—and process it on demand with powerhouse tools. 

So, how do we wrangle it in data lakes? Columnar storage is the MVP here. Think file formats like Apache Parquet or Optimized Row Columnar (ORC)—efficient, scalable, and ready to tackle the madness.

But here’s the catch—those formats weren’t perfect. Far from it. Sure, Apache Parquet and ORC brought efficiency to the table, but they came with baggage.

1. There were no ACID transactions. What does this mean?

    • ACID stands for Atomicity, Consistency, Isolation, and Durability, and simply put, it means that we want operations with our data to meet these criteria so that we can have safety in writes and deletes with minimized corruption of data.

2. There were major observability and data management issues. 

3. Downstream failures from lack of schema enforcement. 

    • Imagine this. In an Apache Parquet file format, every file is a big box. Back then, you could put literally whatever you wanted (mini-boxes per say), inside of each box. With no rule enforcement, it leads to this concept known as schema drift - where basically the structure of our database can change over some time horizon without us even knowing.

This is where the brilliant invention by Databricks comes in 2019, the Delta Lake. 

The Delta Lake Storage Layer

The idea is quite simple. Delta Lake is a storage layer that is built on top of cloud object storage (like Amazon S3), that basically solves all the problems we just discussed above. We have transaction logging for observability, ACID transactions, schema enforcement, and even cool features like time traveling in your data. 

You can think of the storage layer as a software layer that defines a format and some particular protocols for managing tabular data (data that is displayed in columns or tables) + metadata within an S3 bucket. Bare in mind that it can be any object store, such as Azure Blob Storage for example. What is even more fascinating is that this layer is actually built on top of Apache Parquet. So in a sense we never stepped away from Parquet... Databricks  simply made it better.

Storage Layer Breakdown

When we think of this software layer, it can sometimes be hard to visualize mentally, so lets break it down a bit more:

1. Backend: A cloud object store holding parquet files and transaction logs. You have the flexibility to choose the backend storage system, as it’s an open-source storage layer that sits on top of your data lake rather than being tied to a specific storage provider. S3, Azure, the world is your oyster. This is why you'll see a lot of companies build their own unique solutions.

2. Files: Parquet format. What's brilliant about Parquet is that it is so incredibly efficient to store in this format. There's a lot of reasons, the main being that Parquet has columnar storage + does advanced data compression. Now don't panic! Columnar storage sounds fancy but really what it is saying is that it enforces column based storage, which enables cool things like improved compression (because a lot of the times data types will be the same in a column) or more efficient disk reads because I don't have to read a bunch of data within a row to get to a particular column, I can just directly grab that column. Columnar storage sacrifices row-level efficiency for column-level efficiency. It’s optimized for queries like "average age across all rows" (one column read) rather than "everything about John Doe" (all columns for one row).

3. Transaction Logs - this is pretty simple to be honest. The data is in JSON format for the logs, and we can actually checkpoint our data using Parquet. Databricks actually calls it the DeltaLog (https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html)

4. Now the actual computation and querying layer, which is handled through Apache Spark. When I say computation and querying layer, simply imagine that any code that is interacting with the tables (from reading, to writing, to updating, etc...) is being handled by this layer. It is important to note that Apache Spark is NOT a programming language. It is simply an analytics ENGINE that will perform complex computations efficiently. The Delta Kernel which is newer, now supports other engines like Trino, Flink, Presto, Hive, DuckDB, etc... but Spark remains the main and default supported engine. Let's dig a little bit into engines to understand them further and solidify this idea.  

For example... imagine I have a chicken shack and I am selling 10 chickens a day. I start to scale, and I now have millions of chickens being sold with millions of transaction logs. I first need to figure out a cheap way to store them.... so I decide to build myself an organized library with a unique book placement and checkout system. This we can call the Delta Lake. But now I want to analyze my transaction logs to improve the efficiency of my business. I could go through every log by hand... (aka. look at every book in the library)... but that wouldn't really make sense. Instead, I hire a super-robot to read go through every book in my library... analyze it, organize it, and give me answers to the questions I need. At its core, think of Apache Spark as the super-robot brain that organizes my library (the Delta Lake). 

5. Now this hypothetical "super-robot"... (like real world robots), won't really perform any tasks until we give it the proper commands/operations. Think of Apache Spark as your super-robot, that is awaiting commands and operations from the user/developer (you) for it to take action on. There is a wide range of ways to define these operations. And this list is not extensive. Delta Lake API is a big one... and allows us to programmatically perform advanced operations that you'd typically do in SQL, like MERGE. Starting to see the power of this? There is also something called SparkSQL... so you can actually just use plain SQL commands to manipulate your tables if you don't want to write some Python code for your manipulations. There is also an Apache Spark DataFrame API... which really gives you that Pandas DataFrame feel while interacting with Delta Tables. Brilliant! 

Now what is super cool is that this abstraction has changed the way corporations and developers interact with data. It has allowed companies like Databricks to build unique feature sets for their customers to adapt in a strongly data driven world. Let me give you an example. What if I want to expose a subset of my historical data to a third party to access, for governance? No problem. I can use the Delta Sharing feature to expose Delta tables via a RESTful approach. Brilliant right?

Limitations of Delta Lake Approach

Now, in the paper... the authors do pose that there are some limitations to this storage layer approach:

1. Serializable transactions are limited to a single table. Serializable simply means that if I played back all the transactions in sequential order, the outcome would end up being the same. 

2. For streaming workloads, Delta Lake's performance is limited by the latency of the underlying cloud object store. Now to be fair, this isn't really that big of a bottleneck, especially because cloud providers like AWS have such incredible solutions for object storage with super minimal downtimes anyways. 

3. So originally, Delta Lake did not support secondary indexes beyond the min-max statistics kept for each data object - but at the time of publication, something called a Bloom filter-based index was being prototyped. Think of min-max stats as a simple way of getting the smallest and biggest values in each data file. Bloom filter indexing is a much fancier tool that basically guesses if a specific value is inside of a file, instead of looking at ranges. This matters a lot because when we are working with huge piles of data, these indexes tell Delta Lake which files to skip. It can help a lot in messy data lakes with millions of files. It is a path to achieving much more real-time queries by ignoring irrelevant data. 

I don't know what is next in this world of storage layers and file formats, but I've heard that Apache Iceberg is kind of becoming a direct competitor to the Delta Lake format. I might write about it some other time, but what I do know is that it was born at Netflix, and has much less vendor lock in - which is a common complaint with Delta Lake's tight integration with Spark.