December 4, 2023
You Don't Need a Vector Database
Large Language models (LLMs) currently all have a bounded context size (~16K for GPT3.5), which poses a challenge if I want to use it to perform question answering on a very large collection of documents. We simply cannot inject all the documents into the context.
One potential solution is to fine-tune the LLM on the set of documents before expecting useful responses to my prompts. However, finetuning LLM’s is difficult and costly if the set of documents is changing very quickly. For example, if I want to run Llama on my laptop so I can easily find emails, documents, etc, I certainly do not want to finetune Llama on every email received.
The alternative solution is to use Retrieval Augmented Generation (RAG), which breaks down into two stages. First, we quickly search for the small subset of relevant documents in the total set of potential documents and then we include the contents of that document subset into the prompt before sending it to the LLM
We can break down the first task into a few more precise tasks:
A popular way for identifying the relevant subset (the first task) is to first compute a vector embedding of each document, which represents the "meaning" of the document.
We expect that documents that are similar to each other are also nearby each other in high dimensional space.
We can then compute the vector embedding representation of our question and use nearest neighbor search to find the most relevant documents.
Here's a diagram by the Scriv.ai team that describes this process well:
Created and owned by scriv.ai
Vector databases (Pinecone, Milvus, etc) have risen in popularity lately as a means of storing and computing nearest neighbor on a large collection of documents. In this post, I’d like to make the case that you do not need a vector database for RAG.
The task of finding a small subset of relevant documents to answer a question is a well studied area of information retrieval (IR), a field of computer science. The solutions here predate vector databases.
The most obvious examples of systems that have implemented scalable solutions to this problem are search engines like Google, Bing, Apache Lucene, Apple Spotlight, and many others. As an industry, we’ve already created & iterated on highly scalable and highly available technologies using reverse indexes over the last few decades.
While semantic vectors are absolutely a great innovation in the field, they should be used and implemented in the context of the lessons we have learnt building scalable IR systems.
In this post we benchmark and evaluate vector embeddings against classical IR solutions at finding the correct document for answering a given question. And also consider a setting where one could use vector embeddings to improve retrieval, but still, without needing a vector database. The notebook which produces and demonstrates all of the results here in this blog post, is at https://github.com/xetdata/RagIRBench
To guide our comparison in this post, we’ll use the SQuAD dataset. The SQuAD dataset comprises of a collection of paragraphs, and questions for each paragraph, resulting in about 87k questions over 19k paragraphs. Each question is meant to answered only by information only found in the paragraph.
Here’s an example paragraph:
Beyoncé Giselle Knowles was born in Houston, Texas, to Celestine Ann "Tina" Knowles (née Beyincé), a hairdresser and salon owner, and Mathew Knowles, a Xerox sales manager. Beyoncé\'s name is a tribute to her mother\'s maiden name. Beyoncé\'s younger sister Solange is also a singer and a former member of Destiny\'s Child. Mathew is African-American, while Tina is of Louisiana Creole descent (with African, Native American, French, Cajun, and distant Irish and Spanish ancestry). Through her mother, Beyoncé is a descendant of Acadian leader Joseph Broussard. She was raised in a Methodist household.
And a few questions that can be answered from this paragraph are:
What race was Beyonce's father?
Beyoncé was raised in what religion?
What is the name of Beyoncé's younger sister?
However, instead of answering the questions using the provided paragraph, we are going to invert the dataset: given the question, find the paragraph containing the answer.
The dataset is not perfect for this purpose. For instance, there are questions such as "In what R&B group was she the lead singer?" which assume that there is sufficient context to disambiguate the pronoun "she". However, the vast majority of questions do not have this issue, and so the dataset suffices for our experiment here.
Vector Based Retrieval With OpenAI Vectors
Let’s start by evaluating retrieval performance using OpenAI’s embedding vector service.
First, we compute an embedding for every document and question. Then, for each question, we find the top 20 nearest neighbors. Note that we use exact nearest neighbor here as opposed to the approximate methods used by large vector databases, hence retrieval accuracy is not a concern.
In the following chat, we plot the recall at N. (if I return the N nearest paragraphs, how often is the correct paragraph in the set).
This means that if I only look at the top-1 nearest neighbor, I will find the paragraph with the answer ~63% of the time. This increases to 88% with the top 10 documents, and 91.7% for the top 20 documents. This seems pretty good!
Classical BM25 Method
Now, lets compare this with an old "classical" method that pretty much only works with simple word frequencies: BM25. This is basically tf-idf with some additional normalization tricks and see how it performs! We use Spacy to perform tokenization and do some basic normalization (remove punctuation, and lower-case everything).
BM25 is worse than OpenAI embeddings (though interestingly it appears better at top 1). Which is excellent! OpenAI embeddings have value! That said, it is not much worse. Let’s consider how we would evaluate RAG: i.e. to begin with the end metric in mind. If we are targetting a retrieval recall rate of 85% and were to use a vector database, I would need to fetch 7 documents. If I were to use BM25 only, I would need to fetch 8 documents. The real practical difference here is insignificant, considering the cost of maintaining a vector database as well as an embedding service.
Reranking BM25 with Vectors
However, we can do much better.
Information Retrieval is classically implemented to support IR on billions of documents. The standard solutions begin with very efficient, low precision, high recall algorithms that can quickly sift through billions of documents and reduce it to a smaller subset of ~1000 docs. After which more expensive, higher precision algorithms are used to rerank and reduce the document set to find a small set of documents which are truly relevant.
We can use the same principle here. BM25 is a generally inexpensive method that can be efficiently implemented with reverse index solutions to rapidly sift down to a moderately sized dataset. After which, vector embeddings are the higher precision approach which we can use to rerank the dataset.
Using this intuition, we combine the two methods. Use BM25 to extract top 50 results, then use vector embeddings to rerank them.
And this simply cleanly outperform everything at all retrieval counts. At 85% recall rate, only 5 documents are required.
You do not need document embeddings for RAG. Consider simpler information retrieval methods. Vector embeddings are still useful, and are a great way to add a higher precision filter. But to use vector embeddings as a later stage filter, all you really need is a vector retrieval service (any KV store) to be able to do efficient and exact k-NN reranking on a small set of vectors. No vector database necessary.
Reproduce my Results
You can find the code in a Jupyter Notebook supporting all of the results here in this blog post in this GitHub repo. The repo is a Xet managed repo and requires our tiny git xet extension to clone the repo so that non-code assets are also cloned.
The notebook takes a while to run, all intermediate results are memo-ized and are stored as part of the repo for convenience. You can do a lazy clone which will delay the download of the memoized data until it is needed: