What are inverted indices?

Back when Laith and I were creating the Nera search engine, we dove into a world of technicality that neither of us had seen in our years of college. There were many backend pieces to the puzzle, and LLMs were definitely involved. We found that to build a great search, usually you want to choose technologies that complement LLMs, they are not the only one and done solution.

One of these puzzle pieces was the "inverted index". This is a data structure that maps each unique term—think words or phrases—to the specific documents (and often positions within those documents) where it appears. By organizing data this way, we could dramatically speed up full-text searches, making it possible to quickly retrieve relevant results from a massive collection of documents.

We started by scraping the internet, focusing primarily on local websites for cities like Waco to gather information about things to do. We collected this raw data—think event listings, attraction descriptions, or activity guides—and converted it into individual documents. Then, to prepare the text for indexing, we ran it through a stopword filter, stripping out common words like 'the,' 'and,' or 'is' that don’t carry much meaning for search purposes. 

Next, for each unique term that was left after filtering, (say, 'museum' or 'hiking' from a Waco activities page), the inverted index records where that term appears across all our documents. It is implemented as a hash table, with the filtered terms as keys and the values as lists which contain the document IDs and positions within the documents. Here is a basic graphic below:

And that's it! Cool right?